low-power memory controller subsystem ip exploration …

76
Low power memory controller subsystem IP exploration using RTL power flow An End-to-end power analysis and reduction Methodology NEERAJNAYAN BALACHANDRAN DEGREE PROJECT IN ELECTRICAL ENGINEERING, SECOND LEVEL STOCKHOLM, SWEDEN 2019 KTH ROYAL INSTITUTE OF TECHNOLOGY ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Upload: others

Post on 25-Nov-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Low-power memory controller subsystem IP exploration …

Low power memory controller subsystem IP exploration using RTL power flow

An End-to-end power analysis and reduction

Methodology

NEERAJNAYAN BALACHANDRAN

DEGREE PROJECT IN ELECTRICAL ENGINEERING, SECOND LEVEL

STOCKHOLM, SWEDEN 2019

KTH ROYAL INSTITUTE OF TECHNOLOGY

E L E C T R I C A L E N G I N E E R I N G A N D C O M P U T E R S C I E N C E

Page 2: Low-power memory controller subsystem IP exploration …

Low power memory controller subsystem IP exploration using RTL power flow

An End-to-end power analysis and

reduction methodology

Neerajnayan Balachandran

2020-06-25

Master’s Thesis

Examiner

Prof. Ahmed Hemani

Academic adviser

Dimitrios Stathis

Industrial adviser

Ioannis Savvidis

KTH Royal Institute of Technology

School of Electrical Engineering and Computer Science (EECS)

Department of Electrical Engineering

SE-100 44 Stockholm, Sweden

Page 3: Low-power memory controller subsystem IP exploration …

Abstract | i

Abstract

With FinFET based Application Specific Integrated Circuit (ASIC) designs delivering on the promises

of scalability, performance, and power, the road ahead is bumpy with technical challenges in building

efficient ASICs. Designers can no longer rely on the ‘auto-scaling’ power reduction that follows

technology node scaling, in these times when 7nm presents itself as a ‘long-lived’ node. This leads to

the need for early power analysis and reduction flows that are incorporated into the ASIC Intellectual

Property (IP) design flow. This leads to a focus on power-efficient design in addition to being

functionally efficient. Power inefficiency related hotspots are the leading causes of chip re-spins, and

a guideline methodology to design blocks in a power-efficient manner leads to a power-efficient

design of the Integrated Circuits (ICs). This alleviates the intensity of cooling requirements and the

cost. The Common Memory controller is one of the leading consumers of power in the ASIC designs

at Ericsson. This Thesis focusses on developing a power analysis and reduction flow for the common

memory controller by connecting the verification environment of the block to low-level power analysis

tools, using motivated test cases to collect power metrics, thereby leading to two main goals of the

Thesis, characterization and optimization of the block for power. This work also includes an energy

efficiency perspective through the Differential Energy Analysis technique, initiated by Qualcomm and

Ansys, to improve the flow by improving the test cases that help uncover power inefficiencies/bugs

and therefore optimize the block. The flow developed in the Thesis fulfills the goals of characterizing

and optimizing the block. The characterization data is presented to provide an idea of the type of data

that can be collected and useful for SoC architects and designers in planning for future designs. The

characterization/profiling data collected from the blocks collectively contribute to the Electronic

System-level power analysis that helps correlate the ASIC power estimate to silicon. The work also

validates the flow by working on a specific sub-block, identifying possible power bugs, modifying the

design and validating improved performance and thereby, validating the flow.

Keywords

Power analysis, Block characterization, Optimization, Differential Energy analysis, Dynamic Power,

Clock gating.

Page 4: Low-power memory controller subsystem IP exploration …
Page 5: Low-power memory controller subsystem IP exploration …

Sammanfattning | iii

Sammanfattning

Med FinFET-baserade applikationsspecifika integrerade kretsar (ASIC) -konstruktioner som ger

löften om skalbarhet, prestanda och kraft är vägen framåt ojämn med tekniska utmaningar när det

gäller att bygga effektiva ASIC: er. Formgivare kan inte längre lita på den "autoskalande"

effektminskningen som följer teknisk nodskalning, i dessa tider då 7nm presenterar sig som en

"långlivad" nod. Detta leder till behovet av tidig kraftanalys och reduktionsflöden som är integrerade

i ASIC Intellectual Property (IP) designflöde. Detta leder till fokus på energieffektiv design förutom

att det är funktionellt effektivt. Krafteffektivitetsrelaterade hotspots är de ledande orsakerna till re-

spins av chip, och en riktlinjemetodik för att konstruera block på ett energieffektivt sätt leder till

energieffektiv design av Integrated Circuits (ICs). Detta lindrar intensiteten hos kylbehovet och

kostnaden. Common Memory-kontrollen är en av de ledande energikonsumenterna i ASIC-designen

hos Ericsson. Denna avhandling fokuserar på att utveckla en effektanalys och reduktionsflöde för den

gemensamma minneskontrollern genom att ansluta verifieringsmiljön för blocket till

lågnivåeffektanalysverktyg, med hjälp av motiverade test caser för att samla effektmätvärden, vilket

leder till två huvudmål för avhandlingen, karakterisering och optimering av blocket för kraft. Detta

arbete inkluderar också energieffektivitetsperspektiv genom Differential Energy Analys-teknik,

initierad av Qualcomm och Ansys, för att förbättra flödet genom att förbättra test cases som hjälper

till att upptäcka effekteffektivitet / buggar och därför optimera blocket. Flödet som utvecklats i

avhandlingen uppfyller målen att karakterisera och optimera blocket. Karaktäriseringsdata

presenteras för att ge en uppfattning om vilken typ av data som kan samlas in och vara användbara

för SoC-arkitekter och designers i planering för framtida mönster. Karaktäriserings- /

profileringsdata som samlats in från blocken bidrar tillsammans till effektanalysen för elektronisk

systemnivå som hjälper till att korrelera ASIC-effektberäkningen till kisel. Arbetet validerar också

flödet genom att arbeta på ett specifikt underblock, identifiera möjliga effektbuggar, modifiera

utforma och validera förbättrad prestanda och därmed validera flödet.

Nyckelord

Power-analys, Block karakterisering, Optimering, Differential Energy-analys, Dynamic Power, Clock

gating

Page 6: Low-power memory controller subsystem IP exploration …
Page 7: Low-power memory controller subsystem IP exploration …

Acknowledgments | v

Acknowledgments

I would like to thank my supervisor at Ericsson Ioannis Savvidis, for the continuous and motivating

support and providing/enabling the right preliminary platform to get straight into the thesis. All my

technical discussions with him have contributed significantly to my knowledge and understanding of

the thesis and the area of work. I would like to thank Prof. Ahmed Hemani for his valuable insights

during different stages and corresponding presentations of the Thesis progress. Every checkpoint

with Prof. Hemani has helped me guide the eventual progress of the thesis. I would like to thank Pierre

Rohdin G, manager at Ericsson, for his continuous support in setting up the whole environment for

the thesis and pointing me to the right resources to get the work going. Also, thanks to the

Infrastructure team at Ericsson, Mikael Carlsson, Jonas Nyman, Haoge Liu, and Suleiman

Abukharmeh for their timely inputs and feedbacks that helped drive different parts of the thesis.

Thanks to my supervisor at KTH, Dimitrios Stathis, for his continuous support through review

meetings, feedbacks and pointers to improvement.

Stockholm, May 2020

Neerajnayan Balachandran

Page 8: Low-power memory controller subsystem IP exploration …
Page 9: Low-power memory controller subsystem IP exploration …

Table of contents | vii

Table of contents

Abstract ........................................................................................................ i Keywords .......................................................................................................................... i

Sammanfattning ........................................................................................ iii Nyckelord ........................................................................................................................ iii

Acknowledgments ...................................................................................... v

Table of contents ...................................................................................... vii List of Figures ............................................................................................ ix

List of Tables ............................................................................................. xi List of acronyms and abbreviations ...................................................... xiii

1 Introduction .......................................................................................... 1 1.1 Background ........................................................................................................ 1 1.2 Problem ............................................................................................................... 2 1.3 Purpose ............................................................................................................... 3 1.4 Goals ................................................................................................................... 3 1.5 Research Methodology ...................................................................................... 4 1.6 Delimitations ....................................................................................................... 4 1.7 Structure of the Thesis ...................................................................................... 4

2 Background .......................................................................................... 5 2.1 Power consumption in ASICs ........................................................................... 5

2.1.1 Dynamic Power ..................................................................................... 5 2.1.2 Static Power .......................................................................................... 6 2.1.3 Power Bugs ........................................................................................... 6

2.2 Software tools used– Introduction and relevance .......................................... 7 2.2.1 ActivityExplorer (VCD2RPT++ and VCD2TB) ...................................... 7 2.2.2 Spyglass Power .................................................................................... 9 2.2.3 PrimeTimePX ...................................................................................... 12

2.3 Dynamic Power reduction and Clock gating ................................................. 13 2.3.1 Clock Gating ....................................................................................... 14 2.3.2 Clock gating performance metrics ...................................................... 15

2.4 Framework development environment .......................................................... 20 2.4.1 Common Memory Controller ............................................................... 21 2.4.2 Performance Analysis framework ....................................................... 21 2.4.3 IP Design flow – Introducing a Power analysis flow ........................... 22

3 Power analysis and reduction framework development methodology ...................................................................................... 24

3.1 Power Test cases ............................................................................................. 24 3.1.1 Power test case/Stimuli knobs ............................................................ 25 3.1.2 Activity Analysis – Test case characterization and understanding ..... 26 3.1.3 Differential Energy Analysis – Test case tuning for optimization ........ 28

3.2 RTL based early Power Analysis and Optimization ..................................... 33 3.2.1 Characterization of the block .............................................................. 34 3.2.2 Analysis and Optimization flow ........................................................... 36

3.3 Netlist based power analysis .......................................................................... 39

Page 10: Low-power memory controller subsystem IP exploration …

viii | Table of contents

3.3.1 VCD2TB – Netlist simulation using Dump, Convert, Replay .............. 39 3.3.2 PrimeTimePX –Gate level sign-off power estimation ......................... 40

3.4 Cache block analysis – A sample analysis & optimization ......................... 41 3.4.1 Problem ............................................................................................... 41 3.4.2 Solution and validation ........................................................................ 42

3.5 Summary – An end-to-end power analysis and reduction flow .................. 47 4 Results and analysis.......................................................................... 48

4.1 Characterization results for Common Memory Controller (CMC) ............... 48 4.2 Power analysis results for optimization ........................................................ 52

5 Conclusion and Future ...................................................................... 56

References ................................................................................................ 59

Page 11: Low-power memory controller subsystem IP exploration …

List of Figures | ix

List of Figures

Figure 1.1: Synopsys Global User survey results presented at DAC 2019 ................... 2 Figure 2.1: Understanding switching power in CMOS switch ........................................ 5 Figure 2.2: ActivityExplorer GUI ..................................................................................... 7 Figure 2.3: Simplified Activity Analysis flow ................................................................... 8 Figure 2.4: Gate level replay simulation and activity analysis flow ................................ 9 Figure 2.5: Spyglass Power analysis flow - goals and steps ....................................... 10 Figure 2.6: PrimeTimePX Analysis Flow ..................................................................... 12 Figure 2.7: Widely used techniques for Power reduction and control [14] ................... 13 Figure 2.8: Power savings and accuracy attainable at different levels of

abstraction [14] ........................................................................................... 14 Figure 2.9: Possible synthesis of clock gating.............................................................. 15 Figure 2.10: Understanding SCGE ................................................................................. 16 Figure 2.11: Understanding DCGE ................................................................................ 17 Figure 2.12: Understanding ROADF .............................................................................. 18 Figure 2.13: Understanding ROADE .............................................................................. 20 Figure 2.14: A generic ASIC IP Design methodology [16] ............................................. 22 Figure 2.15: Early Power perspective to IP design flow ................................................. 23 Figure 3.1: Activity Profile of a typical use-case for the block ...................................... 25 Figure 3.2: Visualizing port exclusivity knob using VCD2RPT++ ................................. 27 Figure 3.3: Simplified view of test case tuning ............................................................. 28 Figure 3.4: Understanding Differential Energy Analysis - uncovering

inefficiencies, Image from [17] ................................................................... 29 Figure 3.5: Inferences based on the scenarios of energy difference between the

two test cases [17] ..................................................................................... 30 Figure 3.6 : Differential Energy Analysis Implementation flow ..................................... 31 Figure 3.7: Differential Energy Analysis - Flow sequence for redundancy

localization .................................................................................................. 32 Figure 3.8: Analysis and problem localization using Differential Energy Analysis ....... 33 Figure 3.9: Purpose-to-use case/operating point correlation to utilize the flow for

ASIC IP blocks ........................................................................................... 33 Figure 3.10: A representation of power-based operating points of an ASIC IP Block ... 34 Figure 3.11: Dynamic power optimization implementation methodology ...................... 38 Figure 3.12: Using VCD2TB for gate-level simulation ................................................... 40 Figure 3.13: Methodology of identifying power inefficiency in CMC - VCD2RPT++

screenshot .................................................................................................. 41 Figure 3.14: FIFO RTL without Clock gating for comparison ......................................... 43 Figure 3.15: FIFO RTL with Clock Gating ...................................................................... 44 Figure 3.16: Validation and signoff analysis of RTL improvements ............................... 45 Figure 3.17: Power analysis and Reduction Flow Summary ......................................... 46 Figure 4.1: Correlation between activity and loading condition of the block ................ 49 Figure 4.2: Variation of the average clock gating efficiency with increased loading

of the block ................................................................................................. 50 Figure 4.3: Power split-up relationship with respect to incremental loading of

block ........................................................................................................... 51 Figure 4.4: Power bug location in a block using Power analysis Flow ......................... 54

Page 12: Low-power memory controller subsystem IP exploration …
Page 13: Low-power memory controller subsystem IP exploration …

List of Tables | xi

List of Tables

Table 1: Operating point vs profiling metrics data for CMC .................................... 49 Table 2: Metric comparison between the modified and original RTL at Cache

level ............................................................................................................ 53 Table 3: Metric comparison of metrics between modified and original RTL for

ACK_FIFOs ................................................................................................ 53

Page 14: Low-power memory controller subsystem IP exploration …
Page 15: Low-power memory controller subsystem IP exploration …

List of acronyms and abbreviations | xiii

List of acronyms and abbreviations

FinFET Fin Field Effect Transistor

ASIC Application Specific Integrated Circuit

RTL Register Transfer Level

IP Intellectual property

DAC Design Automation Conference

CMC Common Memory Controller

CSV Comma Separated Values

CMOS Complementary Metal Oxide Semiconductor

VCD Value Change Dump

ASCII American Standard Code for Information Interchange

GUI Graphical User Interface

FSDB Fast Signal Database

SPEF Standard Parasitic Exchange Format

SDC Synopsys Design Constraint

I/O Input/output

SAIF Switching Activity Interchange format

FSM Finite State Machine

VHDL Very High-Speed Integrated Circuit Hardware Description Language

ICGS Integrated Clock Gating Cell

SCGE Static Clock Gating Efficiency

DCGE Dynamic Clock Gating Efficiency

CG Clock Gating

ROADF Register Output Activity Density for Flops

ROADE Register Output Activity Density for Enable

UVM Universal Verification Methodology

DSP Digital Signal Processor

STA Static Timing Analysis

DC Design Compiler – Synthesis tool from Synopsys

CT Clock Tree

SOT Start of Transaction

SoC System on Chip

IC Integrated Circuit

ESL Electronic System-level

DUT Design under test

CGE Clock Gating Efficiency

ACK Acknowledgment

FIFO First-In-First-Out

Page 16: Low-power memory controller subsystem IP exploration …
Page 17: Low-power memory controller subsystem IP exploration …

Introduction | 1

1 Introduction

As technology and design sophistication in the field of IC design increases with time, several

buzzwords of recent past have become the norm that critically enable today’s electronic design

industry’s way forward. Power and energy efficiency are terms that have been such buzzwords that

have achieved significant focus and emphasis on being incorporated in the mandates of ASIC IP

design flows. Semiconductor process technology limitations, ubiquitous, battery-enabled application

requirements such as IoT and Edge computing hardware necessitates the need for systematic power

analysis and optimization in design cycles.

This Thesis introduces, implements and validates a systematic power analysis and reduction flow

that can be introduced early in the ASIC IP design flow. This work is performed at the ASIC

department at Ericsson, Kista, and utilizes an internal ASIC IP Block that is a significant consumer of

power, for the analysis. This work aims to set a precursor to involving power as a major consideration

apart from performance, area, and cost in the ASIC design flow. This is the way forward to designing

the improved future of ASICs.

1.1 Background

Designing ASICs with a focus on power efficiency is considered a critical bottleneck concern with

limited tools and methodologies that merges seamlessly into the ASIC design flow, in the designs of

today [1, 2]. Power efficiency has gained focus in mainstream and consumer electronics design as the

way forward. The ability of chipmakers to rely on improving technology nodes that scaled transistor

specifications at each node remained a major advantage that almost negated a need for power

optimization flows in the ASIC design. This auto-scaling feature that came with using improved

technology nodes is no longer applicable today [3]. The cost and complexity of designs sky-rocket with

scaling technology nodes. Also, it is anticipated for 7nm to be a long-standing node in ASIC designs

[4]. Hence, today’s designs and all designs going forward will need a high focus on power efficiency

throughout the ASIC design flow, from as early as architecture choices. The existing trend of myopic

focus on performance, cost, and area as the only design concerns is not feasible anymore.

Power dissipation in a digital circuit can be categorized into two major types - Static and dynamic

power [5]. Current designs scaling below 28nm are enabled by the cutting edge FinFET technology.

FinFET technology has numerous advantages over conventional CMOS technology. They have very

low leakage power (contributes to low static power component), higher drive current per transistor

footprint, and hence higher speed [6]. The low static power dissipation feature provides a good basis

to work on the dynamic power of the design. The dynamic power depends on the stimulus to, or the

usage scenarios of, the design [7]. This improvement requires a utilization-based power analysis and

optimization flow that can characterize a design block for power in a specific usage scenario and then

evaluate the scope for power optimization.

The development and validation of this power analysis and reduction flow is implemented on the

Common Memory Controller (CMC) IP block at Ericsson. CMC is a block that enables the sharing of

memory resources between several units such as accelerators, DSPs, and interface blocks. It is one of

the prime consumers of power in the ASIC designs and works out as a good candidate for the analysis,

characterization, and optimization. This flow involves usage of power analysis tools that supports the

development of a framework around providing the power perspective to efficient IP design. This

Thesis work utilizes the Ericsson internal tools such as ActivityExplorer for RTL and netlist analysis,

and commercial tools such as Spyglass Power from Synopsys [8] and PrimeTimePX the signoff power

analysis tool from Synopsys [9].

Page 18: Low-power memory controller subsystem IP exploration …

2 | Introduction

This Thesis deals with early power estimation and implementation of low power techniques for a

design block. RTL Power estimation and implementation of low power techniques are the two difficult

challenges design teams face in the design flows according to Synopsys, presented in DAC 2019. Also,

this Thesis focusses on clock gating as the low power technique towards improved dynamic power

performance. According to Synopsys global user survey 2016 presented by Synopsys at DAC 2019,

clock gating is the power reduction technique that is the most used among design teams (almost 80%

of the designers use this). So, it can be substantiated that this Thesis addresses the most pressing

challenges using the most prevalent low power techniques relevant to the requirements of the

industry.

1.2 Problem

As the ASIC designers can no longer rely on the power reduction that comes with auto-scaling of the

technology nodes, power analysis and optimization flows need to be developed as part of the ASIC IP

design flow. Power-inefficient designs can lead to unreliable systems, that dissipate too much heat

and leads to a reduced lifetime of chips [10]. This becomes a concern with the high density of

transistors in designs and high computational needs from the designs. Power-efficient designs also

have a good effect on the environment due to lower energy footprint. This Thesis tries to provide

better clarity into the inclusion of power-dissipation-perspective during IP development and helps

optimize designs. This helps design teams possess a better understanding of their designs by

characterizing it and optimizing it using the flow. It helps foresee and improve upcoming designs for

power efficiency early in the design process.

Figure 1.1: Synopsys Global User survey results presented at DAC 2019

Page 19: Low-power memory controller subsystem IP exploration …

Introduction | 3

3

The Common Memory Controller (CMC), a key IP block in both the baseband and radio ASIC at

Ericsson is one of the largest consumers of power in the designs. This motivates the need to

implement a characterization and optimization flow for the CMC hierarchical top and sub-blocks.

This Thesis tries to address the need for implementing a power analysis and reduction flow for

CMC that connects the existing block verification environment to the power analysis tools and lays

the foundation for enhancing similar flows onto other ASIC sub-blocks leading to power-efficient

designs.

1.3 Purpose

This thesis aims to develop, implement, and validate a power analysis and reduction flow for the

Common Memory Controller (CMC) ASIC IP block. The flow must be a stimulus-based power analysis

utilizing the power analysis tools. The use-case based analysis provides accurate estimations and

better pointers to improvement. This flow will aim to provide a systematic method to gather power-

related metrics of the design, analyze the data, and optimize the design for power efficiency. Such a

power analysis and reduction flow on CMC shall provide the baseline for further such analyses and

optimization of ASIC design blocks at Ericsson.

This Thesis will lead to a flow that helps with better designs in terms of power. This leads to energy

efficiency, lower heat dissipation, lower carbon and energy footprint, longer component lives, and

hence lower utilization of resources. From the sustainability perspective every small instance of power

this flow saves accumulates over hundreds of times where these designs are used and lead to

significant energy savings, and hence adds up in tiny steps towards energy-efficient designs.

1.4 Goals

This Thesis work aims to develop a methodology to study the power performance of an ASIC design

starting from RTL, through understanding, characterizing the Common Memory Controller (CMC),

and eventually probe for optimization scopes in power performance. This whole methodology needs

to be systemized using in-house and commercial tools to form a baseline procedure that can be

extended to analyze other design blocks at Ericsson. This goal can be sequentially categorized in detail

as follows.

1. Implement a power analysis and reduction flow for the CMC block, that connects the existing

verification environment to the power analysis tools.

2. Transparent extraction of power metrics using in-house and commercial front-end power tools

by facilitating quick power exploration and profiling. Characterize the block on the basis of the

extracted metrics to enable performance interpolation for future design decisions in IP teams.

3. Profile the subsystem to pinpoint potential power improvements in different workload scenarios.

Trim the results to a shortlist of prime candidate modules, realize the RTL changes and

demonstrate the achievable power savings and area/timing tradeoffs.

This work will lead to a power analysis and reduction flow, utilizes standard power analysis tools,

connects to the existing verification environments of design blocks, works on test cases of various load

scenarios for accurate characterization and utilizes some other test case to simulate and uncover

power bugs in design, leading to optimization of the design.

This flow implemented on CMC will lead to the characterization of the block in terms of power

metrics. It also provides a hierarchical breakdown of possible improvement areas in the design. The

work also validates the flow based on these findings by analyzing the power of the block after

improving the design in RTL.

Page 20: Low-power memory controller subsystem IP exploration …

4 | Introduction

1.5 Research Methodology

The implementation of this Thesis can be split into several subsections that are methodically executed

in overlapping timelines to achieve the goals of the project within the overall duration.

The approach to execution of this Thesis project begins with obtaining a clear understanding

of the design environments, power analysis tools, and the design block in the flow. A detailed

understanding specific to the designs at Ericsson is primary in enabling the work towards the goals

of the Thesis.

The understanding of the tools and design runs parallel to understanding the current

verification environment of the design block. The verification framework in place to analyze the

performance of the block needs to be manipulated and utilized to generate test cases for power

analysis. This needs a good understanding of the set of test cases that will be needed and the knobs

that can help tweak the test cases into desirable power analysis tests. This is motivated by the fact that

the dynamic power of the block is a function of the use-case of the design.

The ability to easily handle the environment and tools in cohesion with the understanding of

the verification framework lays the foundation for achieving good test cases for power analysis and

power bug detection. Once these are implementable, the designs with specific stimuli are estimated

for power and analyzed. An energy-based analysis is formulated to improve the test case

identification for uncovering power bugs (Explained in detail in section 3.1.3). This is followed by

several runs of power analysis, data collection through exported CSVs used in excel and analysis.

Optimization motives from the analysis are realized through design modification in the environment

followed by another power analysis using tools.

The details of the above methodology are explained in Section 0.

1.6 Delimitations

A novel approach to power in designs as developed in this Thesis leads to several pointers towards

using and improving the approach. This Thesis tries to focus on develop, implement, and validate this

flow for a design block. It is performed on the memory controller block at Ericsson and is not

attempted on any other blocks for this Thesis. Certain parts of the analysis are manual using the power

analysis tools. The automation of these tasks will involve working with the tool vendors and is not

considered in the Thesis. The Thesis as part of the validation works on one sub-block and provides

the design team with an extensive worksheet with pointers to improve and does not validate all these

improvements in the given time frame of the Thesis.

1.7 Structure of the Thesis

Chapter 2 provides a detailed background for the Thesis work and provides all prerequisites that shall

help to get into the methodology in Chapter 3. The results of these methodologies are presented and

analyzed in chapter 4. Chapter 5 concludes the report and provides pointers towards inferences and

pointers to the future of the methodology.

Page 21: Low-power memory controller subsystem IP exploration …

Background | 5

2 Background

This chapter provides the background information needed to evolve and assimilate the methodology

of the Thesis sufficiently. The following subsections introduce the concepts of power consumption in

digital circuits, introduce the power analysis tools used and provide their relevance to the Thesis,

introduces the power reduction techniques and their relevance in the Thesis, metrics used for power

analysis, introduces the common memory controller’s significance in the Thesis, introduces the

performance analysis framework and its relevance, also discusses the IP design flow in ASICs, which

shall eventually take the methodology investigated in this Thesis under its wings to be implemented

for improved IP designs with power efficiency perspective.

2.1 Power consumption in ASICs

The terminologies associated with power are inconsistent among the available academic sources and

power tools. But, the physics behind these different terms concur. This Thesis work uses

terminologies for power pertaining to Ericsson lingo and relevant to the tools used.

The power components in an ASIC can be broadly classified into two components, namely,

Dynamic power component and Static power component. Dynamic power is the power component

associated with switching of the transistors in a design, which is contributed to by the stimuli at the

input of the design. Static power can be simply defined as the default power consumption of a design

when it is powered on and is idle or inputs are inactive. Understanding these components is critical

in taking steps to reduce these power components, thereby reducing the average power consumption

of the design. These power components in CMOS circuits are discussed in detail in the following sub-

sections.

2.1.1 Dynamic Power

Dynamic power is the component of power that arises out of the switching activity in the transistors

[11]. It corresponds to the power consumed by the device when the signals at the input are changing.

Dynamic power consists of two components.

Switching power – As in Figure 2.1, in any design the switching CMOS circuits have an associated

capacitive load C L. Switching power is the power spent in charging and discharging the capacitance

of the output net during a logic transition. With a VIN switching at frequency fSW over a time-period

zero to T seconds, the dynamic power is the power consumed in the output capacitor, assuming

voltage VDD across the capacitor and current iDD (assuming ideal components in the figure for

simplicity in formulation).

Figure 2.1: Understanding switching power in CMOS switch

Page 22: Low-power memory controller subsystem IP exploration …

6 | Background

The nodes can switch at a factor of the clock frequency fCLK. Therefore, as a means to realize the

transition rate we introduce the activity factor α, which lies between 0 and 1. This can be used as a

statistical measure of activity across a section of the design. Consequently the switching power can

be formulated as follows[12].

𝑆𝑤𝑖𝑡𝑐ℎ𝑖𝑛𝑔 𝑃𝑜𝑤𝑒𝑟 = 𝐶 ∗ 𝑉𝐷𝐷2 ∗ 𝑓𝑆𝑊 = 𝛼 ∗ 𝐶 ∗ 𝑉𝐷𝐷

2 ∗ 𝑓𝐶𝐿𝐾

Switching power can be reduced by reducing the overall activity factor of a design. Switching

power is one of the major contributors to total power in designs.

Internal power (Short circuit power/Crowbar power) – When the transistors in a CMOS circuit

switch, the imperfections in switching durations and the rise and fall durations of the switching inputs

can cause momentary direct current paths between the supply rails. This momentary short circuit

happens on both edge transitions (rising/falling) on the inputs [13]. The crowbar currents are a

function of the relationship between the rise/fall times at the input and output. The power component

is minimized when the two are comparable. A faster transition at the output compared to the input

transition results in higher crowbar current, therefore higher internal power consumption. Internal

power is not a significant concern for well-designed circuits at scaled technology nodes in recent use

because of lower supply rails and threshold voltages.

2.1.2 Static Power

Static power is the power consumption in a circuits idle state, when there are no signal transitions.

There are several contributing factors to static power, and these are generally modeled into the target

technology library in ASIC designs. Also referred to as leakage power, it is caused by the leakage

currents in CMOS circuits. These leakage currents exist in the powered-on devices even if there is no

switching activity.

Although non-critical for the abstraction level of this Thesis operates on, the leakage currents

can be further categorized into its components such as the reverse-biased p-n junction leakage

current, gate induced drain leakage current, gate direct tunneling leakage current, punch-through

leakage current and subthreshold leakage current [14]. There could be slight variations in these

parameters based on the states of the circuits and these can be modeled into the technology library

details.

2.1.3 Power Bugs

One of the major goals of the power analysis and reduction flow developed in this thesis is to uncover

‘Power bugs’ in a design. When a functionally correct design shows switching activity when it is not

supposed to toggle, the design is identified to contain power bugs. These can be avoided by disabling

redundant switching in inactive parts of a functionally accurate design. There are several techniques

to solve power bugs that are discussed later in the thesis. Earlier analysis and detection of power bugs

lead to significant time and cost savings involved in the redesign for its fix.

Page 23: Low-power memory controller subsystem IP exploration …

Background | 7

7

2.2 Software tools used– Introduction and relevance

The power analysis flow developed in the Thesis relies on the power and activity analysis tools for

reporting and investigations for optimization. The tools introduced in this section are used across

different stages if the Thesis and serve purposes which become clearer in section 0. This section

underlines the concepts and features of the tool that are relevant to the Thesis, thereby enabling the

understanding of the reader as the chapters progress.

2.2.1 ActivityExplorer (VCD2RPT++ and VCD2TB)

ActivityExplorer is an in-house tool at Ericsson that is used to perform activity analysis on designs.

The tool analyzes VCD files produced from RTL level and gate-level simulations. It reports the

average, time-based switching activity and clock gating efficiency. These can be visualized at different

hierarchical levels of detail, color-coded based on the metrics and area coded based on size.

The primary step to using the ActivityExplorer in the Thesis is to identify suitable test cases for

power analysis. This generates an interface inputs VCD file (value change dump) file. It is a

standardized ASCII based dump file that captures value changes on variables in a simulation. The

ActivityExplorer tool takes in the VCD file generated and produces a GUI based visualization as shown

in Figure 2.2. It also provides the activity profile (activity vs time) for the design for a selected

hierarchical instance. A simplified flow of the usage of RTL based ActivityExplorer (VCDRPT++) is

as shown in Figure 2.3.

Figure 2.2: ActivityExplorer GUI

Page 24: Low-power memory controller subsystem IP exploration …

8 | Background

RTL simulation-based VCDs are used to create a power testbench using the VCD2TB tool, this

recreates the stimulus for a netlist simulation. It helps recreate/replay the test stimulus of the RTL

simulation for the netlist simulation as a testbench. The VCD generated from this process is the final

VCD that provides the gate level activity, which can be visualized to obtain gate-level activity and clock

gating efficiency plot data, as the maps in the GUI and as time profiles. This ability to visualize the

activity and clock gating efficiency simultaneously for the time slices in the test case is advantageous.

It helps identify primary pointers to inefficiencies when it directly points to bad clock gating

efficiencies for low activity regions of the design for a given time instant in the test case. This pointer

is picked up for one of the cases/designs at Ericsson and the validity of that inference is evaluated

using more focused power analysis. The flow is for netlist level simulation and activity analysis using

VCD2TB and VCD2RPT++ are defined in Figure 2.4.

These tools form the primary steps for test-case choice for power analysis. It is a faster and

preliminary pointer towards dynamic power and activity of a block during a particular test case. This

helps identify the right test cases for power. Identifying the right test case is of prime importance in

the power analysis flow. The typical goal is to identify low activity test cases to identify redundant

activity and hence dynamic power bugs. These flows can also be used to tweak test cases for specific

characterization purposes of the block based on load. Activity analysis provides the right pointers

towards these tasks in a faster and less resource hogging manner. They form the preliminary step

prior to a resource and time-intensive power analysis using specialized tools.

The test cases and the optimization pointers gathered using the activity analysis tools at RTL and

Netlist level are analyzed and validate further using focused power analysis. The VCD2TB and the

VCD2RPT++ tools at Ericsson, form the basis for the first level of power analysis flow for the designs

at Ericsson and serve the purpose of fine-tuning and improving the later stages of the analysis taken

up in the Thesis.

Figure 2.3: Simplified Activity Analysis flow

Page 25: Low-power memory controller subsystem IP exploration …

Background | 9

9

2.2.2 Spyglass Power

Spyglass Power is a power analysis tool from Synopsys. It facilitates early power estimation through

RTL based power and activity analysis for blocks and power exploration. The tool takes in the initial

RTL, a reference netlist, the target technology library, and the activity file (like an FSDB) as the input

for power analysis and exploration. It provides a relatively accurate RTL power estimation and

actionable profiling metrics such as clock gating efficiencies. It helps prepare the RTL for a better

Inferred clock gating. It helps provide early and fast power numbers, component-wise split-up

visibility, and helps perform a metrics-driven power analysis.

Spyglass Power has certain pre-requisites for power analysis. The design under analysis has to

have a Spyglass toolset-based lint clean design. An accurate analysis necessitates a calibration netlist

reference that enables a good power correlation. It also inputs the target library data and associated

parameters. The analysis needs to have defined power test cases simulated with FSDB files dumped

Figure 2.4: Gate level replay simulation and activity analysis flow

Page 26: Low-power memory controller subsystem IP exploration …

10 | Background

for them. It requires the FSDB dumping tool to be compatible with the version of the Spyglass Power

analysis tool.

Once the pre-requisites for the analysis are in place, the initial steps preparing for the analysis

need attention. The power test cases defined are simulated and FSDBs with correct versions are

dumped. The design of the block under analysis needs to be specified. The analysis is pointed to the

right FSDB, with the time window for which the analysis is expected. Spyglass Power analysis happens

as a sequence of goals. These goals need to be set as a preparatory step for the analysis (Figure 2.5).

The analysis also needs to be pointed to the calibration netlist and clock gating thresholds can be set

to enable different levels of clock gating.

Once the preparatory steps are done, the power estimation as a set of goals is represented in

Figure 2.5. The flow takes in a reference SPEF (Standard Parasitic Exchange format) file, a reference

netlist, the library files, switching activity file obtained from simulations (FSDB files), and certain

power parameters for accurate estimation. The first goal is the design read, which reads the design

block under analysis, which is specified in prior. Then the power audit goal is executed where an audit

is performed to check the design, simulation data, and technology library for consistency and lists the

key parameters in the power estimation. Then the vector analysis goal is performed where the activity

is analyzed for a simulation testbench and an activity profile analyzed over time is generated. Then

finally, the power estimation and profiling goal is run where the estimated power, activity, and

efficiency information for clock, registers, and memories are computed for the time intervals of

interest. It also points to inefficient clock gating and opportunities to uncover power bugs.

Spyglass Power analysis results are reported as follows, further detailing the categories explained

in section 2.1 (Dynamic and Static power)

Figure 2.5: Spyglass Power analysis flow - goals and steps

Page 27: Low-power memory controller subsystem IP exploration …

Background | 11

11

1. Combinational power - It is the power consumed by a combinational cell and a net driven by

a combinational cell. It is a dominant component in Datapath intensive designs and is directly

proportional to high data toggle and large combinational logic. There are various techniques

to control this power component if issues are identified, such as reducing combinational

depths, registering inputs to combinational logic, techniques like data gating.

2. Sequential power - It is the power consumed by the sequential cells in a design (registers and

latches) and the output nets of sequential logic.

3. Clock power - It is the power consumed by the clock network. It is one of the major consumers

of dynamic power.

4. Memory power - It is the power consumed by the memory (based on the library file) and the

output nets of the memory cell. Memory power is proportional to the number of read/write

operations on it. The optimization techniques to control this are the ones to minimize

redundant read/write operations. Memory leakage (default power consumed during power-

on) is a significant contributor to this component and necessitates various operational modes

like sleep, deep sleep to manage the leakage power.

The other power components that Spyglass reports are IO power (power consumed by the IO pads

based on technology library), mega cell, and black box powers (special blocks specified in the

configuration or SGDC constraint files). But these are not of focus for this Thesis.

The tool uses a reference netlist and technology library to pseudo synthesize the design RTL and

estimate power. This estimation relies on the two statistical parameters, activity, and probability.

According to the spyglass manual [19]. Activity is defined as the number of toggles per clock cycle on

the signal, averaged across many clock cycles. Probability is the percent of time that a signal is high.

These statistical parameters are used by the tool as the basis for power estimation. This provides a

reliable starting point for power estimation. Spyglass introduces virtual cells for clock tree modeling.

The pseudo netlist with simulation and parasitic data allows spyglass to calculate the contribution of

static and dynamic power to the total power.

The categorization of power is discussed in section 2.1 can be defined in the context of the Spyglass

Power tool as follows.

1. Leakage power – The leakage power of any cell is specified in the technology library file

corresponding to it. In the Spyglass context, total leakage power is the sum of the leakage of

all the cells present in the design. In the spyglass generated reports, the leakage is broken

down into its contributors such as combinational, sequential, or memories. The leakage

power calculations on spyglass are dependent on, the type of cell instantiated in the reference

netlist, activity data if the library has state-dependent leakage values, declaration of any

power domains.

2. Internal power - It corresponds to the power dissipated within the boundary of a cell when a

state transition occurs. The internal power calculation depends on the library file which

annotates energy data for transitions based on slew rate and output load. Spyglass utilizes the

activity information from the simulation data to estimate how often the cells toggle and use

the technology library data to derive the power numbers. Spyglass estimations of internal

power depends on the types of cell instantiated in the reference netlist, activity data (FSDB

files from simulation), wire parasitic and slew values.

3. Switching power – As discussed in section 2.1.1, switching power can be expressed as a

function of the operating voltage, output capacitance, and the switching frequency (or a factor

of clock frequency). Spyglass computes the capacitance from the contribution of the cell pin

(using the library file), the contribution from the wire capacitance model (using the SPEF file,

or the wire load model). Switching power estimation uses these data in cohesion with the

Page 28: Low-power memory controller subsystem IP exploration …

12 | Background

toggling activity of the net (which is derived from the simulation data). Therefore, the

switching power depends on, the type of cell instantiated in the reference netlist, activity data,

and wire parasitics.

2.2.3 PrimeTimePX

PrimeTimePX is a sign-off power analysis tool from Synopsys used later in the IP design flow for

accurate power estimation closest to silicon. It builds a detailed power profile of the design based on

the circuit connectivity, the switching activity, net capacitance, and the power behavior data from the

technology library. It calculates the power behavior for a circuit at the cell level and reports the power

consumption at the chip, block, or cell levels. [20]

Power analysis using PrimeTimePX is implemented using a Tcl script that is specified in the

work directory. This Tcl file consists of a sequence of steps as described in Figure 2.6. The first step

is to specify analysis mode, which could be averaged or time-based power analysis mode.

The input files needed for such an analysis are discussed in detail as follows.

• Gate-level Netlist – PrimeTimePX takes in a gate-level pre-layout netlist (generally a Verilog

file). The netlist contains leaf-level cells instantiated from the library cells. A flat or

hierarchical netlist can be used for such an analysis.

• Technology Library – The technology library file consists of the library cell, with each cell

consisting of timing, power, and characterization information, such as power numbers per

cell.

• SDC file – This file specifies the design constraints. It specifies the constrains on all ports and

pins, I/O paths of a design, and clock.

Figure 2.6: PrimeTimePX Analysis Flow

Page 29: Low-power memory controller subsystem IP exploration …

Background | 13

13

• Parasitic File – This file contains the capacitance of the nets. It is one of the factors in

determining dynamic power.

• Switching Activity – In the averaged power analysis, a SAIF or VCD file is used to read the

switching activity. These are created from RTL or Gate level simulation.

These files are specified in the power analysis and Tcl files prior to the analysis. The analysis

results in a set of results in the form of reports. The report contains all the relevant power numbers

and metrics for the design being analyzed. These files are generated in the work directory and can be

viewed as reports for analysis. It is to be understood that the power analysis using these tools done in

this thesis does not directly correlate to silicon. The metrics from this analysis are used for the

improvement of the block and for a system analysis using excel that is used to correlate to silicon and

ESL (Electronic System Level) power analysis, which is a closer estimate to silicon.

2.3 Dynamic Power reduction and Clock gating

The push towards low power consumption in digital ASIC designs has led to several techniques used

to reduce power in designs. There are several of these techniques focusing on static and dynamic

components of power. The major generic factors that motivate this push towards low power are

battery life elongation, carbon footprint reduction, hot-spot avoidance in devices/reduced cooling

facilities, and longer component life. This Thesis focusses on dynamic power reduction and this

section details the methods, techniques, and metrics used.

Clock power is one of the major contributors to the overall Dynamic power in ASIC designs. With

smaller/advancing technology nodes and reducing percentages of static power component, the clock

power associated switching powers become points of significant concern. The improvement of static

power components with innovative transistor designs using FinFETs provides the opportunity to

focus primarily on the Dynamic power consumption of a design. Dynamic power component is very

susceptible to power bugs based on usage scenarios. This is due to redundant switching activity that

can arise in certain usage scenarios due to non-visibility of these scenarios in the functional

verification of these design blocks. This provides the scope to introduce a power analysis and

reduction flow in the IP design flow that can improve the design, in a manner similar to functional

verification and code coverage. The development of such a power reduction flow would involve using

power reduction techniques on the design.

Figure 2.7: Widely used techniques for Power reduction and control [14]

Page 30: Low-power memory controller subsystem IP exploration …

14 | Background

Several power-reduction techniques have been devised over the years for the reduction of static

and dynamic power components. Figure 2.7 shows a listing of the techniques as discussed in detail in

the reference [14].

There is a crucial correlation between how early an effort is made for power reduction, the power

savings achieved, and the accuracy error as depicted in Figure 2.8 referencing [14]. This motivates

early power analysis and reduction.

The methodology introduced in this Thesis focuses on the introduction of early power analysis

and reduction in the IP design flow. Analysis at the RTL design stage is a good place to start in terms

of the level of impact and accuracy. Power estimation at RTL driven by the performance analysis

framework helps pinpoint test cases that help identify power bugs and therefore in power reduction.

2.3.1 Clock Gating

Once the power bugs are detected, the power reduction technique focused on in the Thesis for

optimization is Clock gating. Clock gating is one of the critical techniques to address the need for a

reduction in dynamic power. It is one of the most widely taken approach when trying to address power

reduction problems.

The simplistic idea behind clock gating is that, in a register, when there is no activity recorded in

the data input, there is no need to clock the registers during that period. This provides the opportunity

to switch off/disable clock transitions during this scenario. It is common in designs to have several /

a bank of registers driven by a single clock line. In these cases, an enable signal is introduced to

gate/disable the clocking of registers. This signal can be labeled as the clock gating enable.

The pseudo-code snippet below represents the implementation of clock gating.

𝑤𝑎𝑖𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑙𝑘′𝑒𝑣𝑒𝑛𝑡 𝑎𝑛𝑑 𝑐𝑙𝑘 = 1;

𝑖𝑓(𝑒𝑛) 𝑞 <= 𝑑;

The above RTL specifies a clock gating enable ‘en’. The synthesis tool interprets this enable as

one of the two implementations as shown in Figure 2.9.

Figure 2.8: Power savings and accuracy attainable at different levels of abstraction [14]

Page 31: Low-power memory controller subsystem IP exploration …

Background | 15

15

The first implementation is a “re-circulating register” implementation where the enable is used

to select between new data or re-circulating the previous data value. The second implementation

involves gating the clock where, when the enable is off, the clock is disabled [15]. The two

implementations are functionally equivalent but differ in timing and power behavior. An integrated

clock gating cell implementation, in the second implementation, helps better in power saving by using

the clock shutdown mechanism.

Therefore, improvement of clock gating metrics by identifying new clock gating opportunities is

a way to improve the power performance of the design. The goal is to introduce gating cells and

formulate good gating enables for these cells. It is critical to ensure that the enables are designed in

such a way that the clock gating opportunities utilized save power rather than increase it. For example,

in a case where the clock is always enabled, the insertion of clock gates leads to additional enable logic

and will consume more power. A clock gate added to design introduces delay to clock tree and makes

clock tree synthesis tougher. This necessitates the need for a differential power computation to ensure

that gating does introduce power savings. Clock gating implementations need to be verified for

impacts on the testability of designs. With designs involving state machines, idles states can be

identified for clock gating of certain sections of the IP adaptively. In IPs designed with multiple clock

domains, idle clock domains can be identified and gated.

2.3.2 Clock gating performance metrics

A power analysis framework implemented in the Thesis that connects to a performance analysis

framework and points to scopes for improvement of dynamic power needs a few well-defined dynamic

performance metrics to work on. While the relevance of these metrics to the complete flow will be

discussed in Section 3, this section introduces the clock gating metrics. These are ratios that provide

an indication of how effective clock gating is in the design being analyzed. These metrics are derived

to be used with the Spyglass Power analysis tool used in this Thesis. These are practical metrics that

are available in the tools to evaluate. The power analysis flow developed in conjecture with the tool

and the metrics leads to early power bug detections pointers. These metrics are accurate early from

the analysis of RTL designs. This enables early power analysis and power bug detection.

Figure 2.9: Possible synthesis of clock gating

Page 32: Low-power memory controller subsystem IP exploration …

16 | Background

2.3.2.1 Static clock gating efficiency

Static clock gating efficiency (SCGE), also known as the clock gating ratio is a structural metric.

It can be defined as the percentage ratio of clock gated registers with respect to the total number of

registers. The ratio can be represented as follows.

𝑆𝐶𝐺𝐸 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑙𝑜𝑐𝑘 𝑔𝑎𝑡𝑒𝑑 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑠

𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑠

This metric can be understood as the percentage of registers in the analyzed design that are

enabled with clock gating. This provides an idea of what percentage of registers in a design hierarchy

can be clock gated for performance improvement. Although low SCGE is a direct indicator of low

scope for clock gating, this metric is best used in combination with the other clock gating metrics that

shall be discussed.

Figure 2.10 shows a sample register tree for which the Static clock gating efficiency is computed.

In the example (Figure 2.10), three of the five registers have an inferred clock gate. This leads to

a static clock gating efficiency of (3/5), that is 60%

Figure 2.10: Understanding SCGE

Page 33: Low-power memory controller subsystem IP exploration …

Background | 17

17

2.3.2.2 Dynamic clock gating efficiency

Dynamic clock gating efficiency (DCGE) is an activity-based gating performance metric. Also

referred to as simply the Clock gating efficiency, because of its significance as the efficiency metric,

DCGE can be simply defined as the percentage of time a clock is gated. This can be represented as

follows.

𝐷𝐶𝐺𝐸 = 𝐺𝑎𝑡𝑒𝑑 𝑜𝑟 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑒𝑑 𝑐𝑙𝑜𝑐𝑘 𝑡𝑜𝑔𝑔𝑙𝑒𝑠

𝐴𝑙𝑙 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑐𝑙𝑜𝑐𝑘 𝑝𝑖𝑛

= 1 −𝐴𝑐𝑡𝑖𝑣𝑒 𝑐𝑙𝑜𝑐𝑘 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑎𝑓𝑡𝑒𝑟 𝑐𝑙𝑜𝑐𝑘 𝑔𝑎𝑡𝑒

𝐴𝑙𝑙 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑐𝑙𝑜𝑐𝑘 𝑝𝑖𝑛

DCGE is a measure of how effectively the instantiated clock gates suppress the clock going to the

registers. This can be understood using the example that follows.

In Figure 2.11, the average dynamic clock gating efficiency can be calculated as the mean of the

DCGE on the three clock gated registers. Hence the average DCGE is calculated as,

𝐴𝑣𝑔. 𝐷𝐶𝐺𝐸 = (1

3) ∗ ((

4

8) + (

5

8) + (

6

8)) = (

5

8) = 62.5%

This leads to an average DCGE of 62.5% for the sample design.

Since DCGE is a measure of how effectively the instantiated clock gates suppress the clock, higher

DCGE for a design is good. But as a metric, although higher DCGE means more efficient clock gating,

Figure 2.11: Understanding DCGE

Page 34: Low-power memory controller subsystem IP exploration …

18 | Background

it doesn’t take into consideration the data line transitions that need to be clocked when DCGE

numbers are low. Thus, lower DCGE does not necessarily mean bad clock gating as there is the

possibility that the majority of clock cycles cannot be suppressed due to data transitions on the data

line that need to be clocked in those cycles. This necessitates the need for other metrics that take into

consideration data transitions in a design as well. This leads to the efficiency metrics discussed in the

following sections.

2.3.2.3 Register Output Activity Density for Flops

Register output activity density for flops (ROADF) is a register level metric that depends on the

level of activity on the data line. It is a measure of how effectively or adequately data transitions are

clocked in a register that has a gated clock. It looks for redundant clock transitions on the clock pin

when there are no data transitions on the data line. This metric can be formulated as follows.

𝑅𝑂𝐴𝐷𝐹 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝐷𝑎𝑡𝑎, 𝑄 𝑝𝑖𝑛

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑒 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑐𝑙𝑜𝑐𝑘 𝑝𝑖𝑛

This metric can be better understood from the example below.

In the Figure 2.12, using the transitions waveforms of the clock, the clock gating enable and data

lines, we can arrive at the ROADF and DCGE using the above definition as follows,

𝑅𝑂𝐴𝐷𝐹 = (2

3) = 66%

𝐷𝐶𝐺𝐸 = (5

8) = 62.5%

But from the waveforms we see that the clock gating enable allows more clock cycles to be un-

gated than necessary for clocking the 2 data transitions during the period. This provides the scope for

improvement of the ROADF metric to 100% for optimal clock gating while clocking the data

appropriately. This leads us to the improved gating enable waveform which leads to a ROADF as

follows.

Figure 2.12: Understanding ROADF

Page 35: Low-power memory controller subsystem IP exploration …

Background | 19

19

𝑅𝑂𝐴𝐷𝐹 = (2

2) = 100%

A 100% ROADF is an indication of optimal clock gating for that register. If the clock was gated

using the improved enable, it would lead to a reduction in the extra enabled clock cycle leading to a

dynamic clock gating efficiency computed as follows.

𝐷𝐶𝐺𝐸 = (6

8) = 75%

This turns out to be the highest achievable DCGE for the given number of transitions in the data

line, which is the optimal clock gating. This clarifies the ROADF metric’s importance as a validation

towards the highest achievable DCGE. The SCGE and DCGE metrics do not provide a clear picture of

the clock gating performance when looked at, in isolation. These metrics in combination with the

ROADF metric leads to a better analysis as developed in this Thesis.

2.3.2.4 Register Output Activity Density for Enables

Register Output Activity Density for Enables (ROADE) is an extension of the ROADF metric

discussed in the previous section, onto a data path of registers. When a bank of registers, suitably of

a similar functionality or data path is enabled by a common enable for clock gating, ROADE is the

metric that provides a measure of how efficient the clock gating is, considering the data transitions

on all the data lines in the data path. As an extension of the ROADF metric for a bank of multiple

registers, ROADE can be formulated as follows.

𝑅𝑂𝐴𝐷𝐸 = (𝑁𝑢𝑚𝑏𝑒𝑟 𝑖𝑓 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑜𝑛 𝑎𝑙𝑙 𝑑𝑎𝑡𝑎 𝑙𝑖𝑛𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟 𝑏𝑎𝑛𝑘 )

( 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑒 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑐𝑙𝑜𝑐𝑘 𝑝𝑖𝑛 𝑡𝑜 𝑡ℎ𝑒 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟 𝑏𝑎𝑛𝑘)

It helps identify inefficient clock gating enables that clock the registers even when there are no

data transitions across the data path of the register banks. Figure 2.13 helps clarify the idea with

examples. The figure shows a bank of registers connected with a common clock and a common clock

gating enable. ROADE is a metric that is relevant to multi-register data paths for which clock gating

is enabled.

Page 36: Low-power memory controller subsystem IP exploration …

20 | Background

From this example we see that R1 and R2, driven by the common clock, also utilize the common

clock gating enable signal. So, as we the previous sub-section, the ROADF is 66.66%. According to

the definition of ROADE, we see that,

𝑅𝑂𝐴𝐷𝐸 = (3

3) = 100%

Although at register level we see that the clock gating is not optimal, when looked at in the register

data path level we see full utilization of the clock transitions in the clocking the data transitions across

all data paths. This leads to ROADE being a more reliable metric in cases of multi-register data paths.

Since the enable is common to these registers, it becomes irrelevant to consider the ROADF metric

for this bank of registers. A high ROADE value is necessary for the clock gating performance to be

optimal. It becomes clear that the highest achievable DCGE is reached when ROADE tends to be

100%.

ROADE, DCGE, and SCGE are the major metrics that are taken forth and used to be the goal

metrics in the power analysis framework developed and implemented in the Thesis. A combined view

of these provides the current quality of clock gating and helps point out the way ahead for

improvement.

2.4 Framework development environment

This section introduces the tools and methodologies that currently assist the ASIC design and

motivates how the power analysis framework developed in the Thesis can be placed into this existing

way things are done. The section is split into sub-sections that shall introduce the Common memory

controller (CMC), the IP design flow, and the existing verification environment for the CMC that the

power analysis framework shall connect to for appropriate stimulation for power analysis. An

understanding of these dependencies on the framework development provides an overview of all

considerations that might need to be taken to make it effective.

Figure 2.13: Understanding ROADE

Page 37: Low-power memory controller subsystem IP exploration …

Background | 21

21

2.4.1 Common Memory Controller

The Thesis focuses on developing a power analysis flow for the Common Memory Controller (CMC)

subsystem IP, at the ASIC team in Ericsson. It is a persistent part of the ASIC designs and one of the

major consumers of power in the designs. This motivates the development of a power analysis flow

focusing on CMC although the goal is to be able to develop the framework that can eventually be

custom fit to be performed on other design blocks as well.

CMC is a block that enables the sharing of memory resources between other blocks such as DSP,

accelerators, or other interface blocks. The design is hierarchical and enables a top-down analysis

approach when analyzing power and related metrics. The CMC consists of several sub-blocks and the

goal of the Thesis will be to be able to characterize the block using power numbers and other metrics,

look for potential power bugs. This approach needs a sufficient understanding of the sub-blocks and

facilitating the right stimulus to generate the varying levels of block utilization for power analysis. The

awareness of the sub-block functionality also helps confirm improvement areas and feasibility of

actually making improvements.

Simplistically, for this report focusing on power analysis, the CMC can be viewed as a subsystem

that provides DSP and other client-like systems read/write accesses to memory areas.

Characterization of the block would involve, the variation of the read/write operations intensity,

length, and payload to measure, monitor, tabulate and plot the critical parameters that help

understand the design operation and help extrapolate these for future designs. The dynamic power

improvement scopes are to be identified by figuring out redundant switching in the designs, especially

during low activity operation of the design. This Thesis utilizes the performance analysis framework

to create stimuli suitably for power analysis focusing on characterizing the block and optimizing the

block.

2.4.2 Performance Analysis framework

The development of a power analysis flow uses the performance verification framework developed for

Common Memory Controller (CMC). This is a UVM based verification environment, where different

test cases are simulated with varying parameters to characterize the performance of CMC for different

software loads. It also has options to stochastically load the block using variation of parameters. Some

of these parameters are connected to the power analysis flow developed. These parameters are

tweaked to generate the right test cases that shall be utilized for the two purposes of the power analysis

flow, namely, characterization and optimization of CMC. The exact knobs that are used for this and

the motivation towards generating these test cases are discussed in section 3.1.

It is critical to perform power analysis on the right stimulus to the design under analysis. This

provides the right scenarios to create a load-based power profile and also to uncover power bugs using

the right stimulus. Characterizing a block means analyzing the various loading scenarios that the

design is subjected to and analyzing the block for power by generating the stimulus that pushes the

block into these respective operating scenarios. Power bug detection is ideally enabled when looking

at low load scenarios. Low load scenarios lead to lower switching activity and thus helps uncover

redundant switching activity when according to the stimuli the design areas were supposed to be

inactive. Such a control over the stimulus is achievable by connecting the power analysis flow to the

performance verification framework for its stimuli. The Thesis motivates this logical connection to be

made between the two and utilizes the verification framework to be modified and used as the enabler

for the development and implementation of the power analysis flow.

Page 38: Low-power memory controller subsystem IP exploration …

22 | Background

2.4.3 IP Design flow – Introducing a Power analysis flow

The development of an ASIC chip involves several steps that have been modeled over the years as a

sequence of steps of a design flow. This process starts with a requirement or concept as the starting

point. Based on this requirement, architectural specifications are framed. Then multiple iterations of

RTL coding are done followed by multiple iterations of RTL simulation and verification. The RTL

simulation is followed by logic synthesis and optimization. One of the other important steps during

synthesis is the static timing analysis (STA). STA is performed to check and ensure that all the

functional requirements of the design are achieved with timing closure. In the cases of failure to

achieve timing closure (insufficient slack) the optimization and logic synthesis might need revisiting

for improvement of the design timing performance. Once the pre-layout static timing analysis is

passed the designs proceed closer towards silicon through floor planning, placement, and Clock tree

insertion. This is validated with another static timing analysis before finalizing the routing. Finally,

the routed design is taped out. This simplified IP design flow is visualized using the flowchart in Figure

2.14. [16]

Figure 2.14: A generic ASIC IP Design methodology [16]

Page 39: Low-power memory controller subsystem IP exploration …

Background | 23

23

Now when we move this procedural flow on to implementation, we focus on the checkpoints of

the design flow. Figure 2.15 introduces the idea of where power analysis fits in the design flow. The

power analysis flow as discussed in the next section can be started when the RTL code is written, and

a verification environment is set up for the functional verification of RTL. This can enable RTL power

analysis and gate-level power analysis.

The parallelization is shown during the simulation verification and synthesis phases, when the

RTL is available for the design. The yellow boxes connect the power-based RTL iteration to the RTL

design stage. The dark green boxes connect the power analysis outcome to the flow as a checkpoint.

The sub-flow in the light green box represents a power analysis-based optimization that can be

considered similar to code coverage analysis. A power analysis flow that fits this way in the IP design

flow creates a cohesive verification environment for functional and power optimization of design. This

framework includes a power-perspective that is missing the standard IP design flow is what this

Thesis motivates to develop.

Figure 2.15: Early Power perspective to IP design flow

Page 40: Low-power memory controller subsystem IP exploration …

24 | Power analysis and reduction framework development methodology

3 Power analysis and reduction framework development methodology

The power analysis and reduction flow whose pre-requisites were laid out in the previous sections, it

was implemented in several steps that spanned the duration of the Thesis. This section explains the

methodology.

Although the flow implementation is associated very closely to the Common memory controller

(CMC), the verification framework for CMC, and specific power analysis tools, the idea of the

methodology presented is to provide a core concept that can be generalized and used for any ASIC

design blocks for their characterization and optimization (activity-based redundant dynamic power

reduction), using any environment or tools. The feasibility of such a flow is good on any design that

has the RTL code written and verification environment set up. Based on the use-cases of the block,

the performance validation framework can be tweaked to generate stimulus (focused test-cases) for

power analysis and power reduction. This connection to the verification framework of an ASIC IP

Block, as a source for power analysis stimulus, forms the first part of the power analysis framework

developed in this Thesis. This is followed by usage of the stimulus and design for characterizing the

design and pointing to bugs in the design. Based on this step we optimize the design with RTL

modification and repeat the power analysis steps to validate the flow.

So, the subsequent sub-sections detail the steps followed in the methodology that helps develop

the power analysis and reduction framework.

3.1 Power Test cases

Power test cases are very important to clarify before power analysis. Time and effort need to be spent

on specifying test cases for which the power analyses make sense. Power analysis cannot be

generalized for an ASIC IP Block. Power analysis and its results are relevant only when associated

with specific use-scenarios or test cases on the block. For example, for a microprocessor IC design, if

the power analysis is done by booting Linux, and the actual use-case of the IC is a certain end

application running on Linux. The loading situation of the test-cases for these two scenarios vary and

it doesn’t make sense in using a Linux boot power test case for characterizing the design operation in

a different use-case.

The performance analysis framework (described in section 2.4.2) in place for CMC functional

verification is used to create the power test cases. As discussed in section 2.4.1, for the sake of the

development of this power-analysis flow, CMC can be viewed as a subsystem that provides DSPs and

other client-like systems read/write accesses to memory areas to resources. This leads us to the need

for specifying the parameters or knobs that help characterize test cases. The first step to identifying

knobs is to understand the blocks typical use-case and its corresponding activity profile. An

understanding of the test case that can simulate this use-case helps arrive at the knobs to create the

power test cases. Figure 3.1 is an indication of the types of activity curves we achieve by simulation of

the power test cases. The understanding of the use-case in this profile translates to, a peak for buffer

initialization at the beginning of the test case, followed by a series of read/write operations of fixed

length and intensities. This causes the roughly static part of the curve. This is followed by the end of

the test case.

The subsequent sections clarify these knobs, the motivation to use those, the methods to choose

test cases for the two purposes of the power analysis flow, and a summary of the power analysis test

cases that shall be used henceforth in the methodology descriptions.

Page 41: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 25

25

3.1.1 Power test case/Stimuli knobs

Power test cases for Common Memory Controller (CMC) are developed from the verification

framework using a set of parameter knobs which are varied to create different usage scenarios for

power. As an initial step to understanding these knobs and the test cases, the CMC is simulated for

variation of these parameter values across all values within the functional boundaries. Simulations

across the interest intervals of these parameters helps create a map of the test cases that will come in

handy when clustering them as test cases for power characterization and power optimization. This

section further introduces these parameter knobs and motivates towards their role in the power

analysis framework.

3.1.1.1 Number of buffers allocated in test case

This knob controls the number of buffers allocated to enable the memory read/writes from the

Common Memory Controller (CMC). The buffers in the test case help execute and randomize the

accesses. A higher number of buffers increases the buffer initialization period which is the initial spike

seen in the sample activity profile in Figure 3.1. Although this knob has an impact on the overall

energy of the test case, it’s doesn’t affect the instantaneous value of power during the required

read/write operations. This knob was varied and tested across the possible intervals. This variation

was analyzed using the VCD2RPT++ ActivityExplorer tool and activity profiles were generated as a

map for the knob values. The number of buffers allocated can be used as a knob used to minimize

randomization and increase predictability of memory access for known access sizes. The significance

of this feature of the knob came into importance when computing the cost of access where efforts were

needed to maximize the predictability of access while analyzing the power cost per access.

3.1.1.2 Number of accesses per test

The number of accesses knob fixes the number of accesses that happen in a test case for a client.

Higher the number of accesses, longer the read/write access durations, and hence longer test

durations. This is another knob that impacts the energy of the test case but has no effect on the

instantaneous power of the test. In the activity profile in Figure 3.1 the knob affects the roughly stable

duration of activity where the read/write access takes place after the buffer initialization. This test

knob’s direct relation to test duration is used to create longer tests or shorter tests based on

requirements. Otherwise these are kept constant for tests that vary other knobs that impact the

instantaneous power. The number of accesses per test was varied across the possible values to study

the effect of this knob. The ActivityExplorer VCD2RPT++ was used to analyze the activity.

Figure 3.1: Activity Profile of a typical use-case for the block

Page 42: Low-power memory controller subsystem IP exploration …

26 | Power analysis and reduction framework development methodology

3.1.1.3 Number of used ports/clients

The number of ports is an important metric that directly relates to power. It can be understood

as the number of clients (like DSPs or accelerators) that perform read/write transactions through the

Common Memory Controller (CMC). A higher number of ports corresponds to the higher activity of

the block and therefore higher power. The number of ports is varied across the simulations and

activity analysis is performed using VCD2RPT++. This is also a knob that helps ensure that the power

consumed by the block scales with the loading of the block. If that is not the case it shows that the

system consumes power due to power bugs caused by redundant activity that is independent of the

load. This is a knob that is also used to calculate the cost of access as a function of the number of client

accesses. These are valuable analyses to make for the characterization and optimization of CMC.

3.1.1.4 Intensity of transactions

The roughly stable part of the activity profile in Figure 3.1 translates to the read/write transactions

happening in the test case. The intensity of the transactions can be understood as the delay between

the execution of two transactions. This is defined in the test as the delay between start-of-transaction

(SOT) of two read/write accesses. The higher the intensity of transactions, the higher the activity and

therefore higher power consumption. If we consider the complete test case execution as a job on the

Common Memory Controller (CMC), intensity as a knob helps control the intra-job delay between

transactions that make up the job. As an exercise of understanding and characterizing the knob the

values of intensity was varied across the functionally viable values, and activity of CMC analyzed

across the test case using the VCD2RPT++ activity analysis tool. This knob can be used to load the

design to different levels to characterize the block and also to introduce intra-job delays as a means

to uncover power bugs as discussed in section 3.1.3

3.1.1.5 Port access exclusivity

Common Memory Controller (CMC) consists of multiple interconnect blocks through which the

clients perform memory access. They are used to enable options to logically or functionally discretize

and categorize memory accesses. The port access exclusivity as a knob was introduced to enable

selective and exclusive access to ports and interconnect block that the ports are a part of. This knob

enables characterization by allowing individual accesses for power cost calculation of incremental

read/write accesses. This knob enables optimization by enabling interconnect blocks selectively and

identifying redundant switching in unused interconnect blocks. This is another knob achieved by

modifying the verification framework by identifying opportunities for characterization and

optimization achieved through brainstorming and discussions. The port exclusivity knob was tested

with all combinations of interconnect ports on CMC and analyzed through visualization on the

VCD2RPT++ activity analysis tool. Figure 3.2 is a screenshot capture depicting exclusive access on

one of the interconnect blocks, from the activity analysis tool. In this case the green blocks depict the

interconnect blocks through which memory access is enabled. The gray block is not used for any client

access. The associated activity plots corresponding to the interconnect blocks show the activity during

the test case. This example enables identification of redundant switching activity in the interconnect

sub-block that is supposed to be idle.

3.1.2 Activity Analysis – Test case characterization and understanding

Once the test knobs are decided and enabled in the test framework, an important step towards power

analysis is using them to test them out across the feasible range and study their impacts on the test

case. The knowledge of the effect of these knobs on the block and the test leads to decisions regarding

the tests and the knob settings that will proceed to be used in the power analysis flow. And this

Page 43: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 27

27

decision regarding test cases based on the knobs is an important initial checkpoint in the power

analysis flow.

For understanding these knobs, Common Memory Controller (CMC) design is simulated with

different values of these knobs. Now, these selected knobs can be enabled to be varied as input

arguments in the simulation commands to run the test case. There are also some less obvious knobs

such as port exclusivity which are enabled by modifying the test case in its System Verilog based

implementation. All the knob combinations obtained, are simulated across values and results

captured with activity profiles. The VCD2RPT++ ActivityExplorer, an in-house tool at Ericsson is used

to get these visualizations. The test cases are simulated to generate VCD (Value Change Dump) files.

The activity analysis tool takes in the VCD to analyze the switching activity across the design and helps

with visualizing this activity of different hierarchical elements over time. The visualization also

highlights which parts of the design are active during which period of the test case and is suited very

well to understand the implications of the test case and the knob variations on the block design. Figure

3.3 is a representation simplifying the process of test characterization before proceeding to power

analysis.

Activity analysis is a good preliminary step for power analysis because power follows the trends

of activity very closely in general. Since activity analysis takes lesser computational effort, lesser time,

and in this case of VCD2RPT++ did not need an external license for internal use. This leads to quick

pointers to understanding of the test case behavior with varying parameter knobs. The extended

functionality achieved through VCD2TB enables to replay the stimulus used for the RTL simulation

Figure 3.2: Visualizing port exclusivity knob using VCD2RPT++

Page 44: Low-power memory controller subsystem IP exploration …

28 | Power analysis and reduction framework development methodology

to generate test benches that can be used to simulate with the netlist, if that is available. This provides

a more accurate activity analysis and helps get an idea of the clock gating efficiency too. These

procedures using these tools are good preliminary analysis tools leading to actual power analysis, as

these are relatively less time consuming and helps fine-tune the test cases that go on into a more

exhaustive, resource and time-intensive power analysis using dedicated licensed tools.

3.1.3 Differential Energy Analysis – Test case tuning for optimization

Once the verification framework is understood and the test knobs are framed for power analysis, the

focus is on finding the right test cases for the characterization and optimization requirements of the

power analysis flow. The test cases for characterization span across all possible loading conditions of

the block, as it is the requirement from the characterization of a block, to understand the behavior

and performance metrics in all possibilities the block operates in. The test cases for dynamic power

optimization though focuses on uncovering redundant switching activity across the design. This

involves identifying design areas that are active but are not functionally necessary for that test case.

There are different approaches taken to identify such cases and Differential Energy analysis is one of

the techniques that help arrive at test cases indicating redundant activity and scope for optimization.

It is a technique combinedly developed and published by Qualcomm and Ansys. [17]

Identifying test cases well suited to uncover power bugs is critical. This is where Differential

Energy analysis comes into play. This is a technique that involves several steps that are not supported

by the power analysis tools used in this Thesis and therefore was developed manually in steps using

Excel and other open-source tools cohesively with the power analysis tools using the underlying

concept explained in the subsequent section.

3.1.3.1 Concept

Looking through simulation for all redundant toggles in designs, is not the ideal way to identify

inefficiency in a design. Such a search is going to be exhaustive and time-intensive in contrast to the

Figure 3.3: Simplified view of test case tuning

Page 45: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 29

29

impact on power savings. This motivates the Differential Energy Analysis which points early to scope

for dynamic power optimization at RTL. The core of the idea is to focus on the energy of

jobs/operational scenarios of a design rather than power. Energy can be understood as power

integrated or summed up over time. It can be formulated as below.

𝐸𝑛𝑒𝑟𝑔𝑦 = ∫ 𝑃𝑜𝑤𝑒𝑟. 𝑑𝑡𝑡

0

= ∑(𝑃𝑖𝑒𝑐𝑒𝑤𝑖𝑠𝑒 𝑝𝑜𝑤𝑒𝑟 ∗ ∆ 𝑡𝑖𝑚𝑒)

This leads to a rather a novel technique where instead of looking directly at redundant switching,

the energy consumed by the design for a test case is compared with energy consumption of the slowed-

down version of the test case running on the design, achieved by mimicking stalls, starvations or

additional latencies in the test cases. The stalls or intra-job latencies do not affect the original

workload. This means for a given workload, irrespective of the stalls, the energy consumed by both

the typical and stalled test cases need to be the same. This is expected because, as these latencies

increase the duration of the test, the power has to decrease proportionally. The tests where the energy

is not similar in the two cases is a point of concern. This points to redundant switching in the design

because the same workload does not consume the same energy. This means although the same

workload was expected, it turns out it is not so, due to inefficiencies.

This idea is demonstrated in Figure 3.4, where the plot on the left shows the power profile of the

two variations of the job/test as the ideal example of a design optimized for dynamic power efficiency.

The blue area refers to the typical job’s power versus time plot, with the area under it being the energy.

The yellow area refers to the same job with intra-job latencies leading to a longer job. But since both

the jobs carry out the same workload, the energy, area under the graph are the same. Although,

designs in general, with different levels of inefficiencies, do not always behave ideally. The plot on the

right side shows the realistic scenario where the energy of the job with intra-job delays exceeds the

energy of a typical job execution. This points to redundant toggles occurring in the design during the

idle durations or stalls. So according to the plot, rather than being only dependent on workload, the

energy consumed becomes a function of the run-time of the test too. This is undesirable and points

to inefficiencies. This helps in identifying the right test cases that are useful for power analyses of

subsystems when focusing on uncovering power bugs and looking to optimize.

Figure 3.4: Understanding Differential Energy Analysis - uncovering inefficiencies, Image from [17]

Page 46: Low-power memory controller subsystem IP exploration …

30 | Power analysis and reduction framework development methodology

3.1.3.2 Implementation

Differential Energy analysis, despite being an interesting approach towards identifying tailored

test cases for power optimization, is not a technique that can be accomplished end-to-end using the

power analysis tools used. This necessitates the implementation of the methodology manually using

other tools. This involves a sequence of steps that should precede the exploration of power reduction.

The first step to deciding the feasibility of a test case for optimization is to simulate the test case

and dump its VCD for the test case. Now we need a version of the same test, that performs the same

workload, but with bubbles/stalls/intra-job latencies introduced in the test framework. In the case of

the Common Memory Controller (CMC) and its verification framework, the Intensity of read/write

transactions (discussed in section 3.1.1.4) is decreased. This can also be understood as an increase in

the time delay between the start of a transaction (SOT) of one access, to the SOT of the next. When

the two test cases are ready and are simulated to dump their respective VCDs, the next step is to obtain

their power profile (power vs. time). A Spyglass Power estimation analysis on the test case generates

this Power vs time plot. Once we have the power profile plots corresponding to the two cases, the next

step is to calculate the energy content of the two curves for the differential energy analysis. But since

this is not a feature that is readily available in the power analysis tools, the curves need to be converted

into data points to calculate the energy content manually. The open-source tool WebPlotDigitizer

Version 4.1 distributed under the GNU Affero General Public License Version 3 [18], is used to convert

the plots obtained from the power analysis of the two cases, into Comma Separated Values (CSV) for

further computations. It takes in the power profile plot images as inputs and with some manual

scoping, converts the plot into a CSV consisting of the time and corresponding power values. Once

the CSV is obtained from the plot digitizer tool, the energy content of the curves is calculated in

Microsoft excel using the equation for energy defined in section 3.1.3.1. Since we are looking at a

digitized plot, we use the summation of the area method, by splitting the area under the curve into

tiny rectangles to get the closest integral estimates. Once we have the energy of the two plots in Excel,

these are subtracted and noted for excess energy in the delayed test case. The higher the energy

difference between the two cases the more the scope for identifying redundant toggling activity in the

design. This whole procedure implemented using the tools is represented in Figure 3.6

So, to make more meaning out of the procedure described, we start with the total power curves

for the differential energy analysis, i.e the total power profile with and without the bubbles are

compared. If we see a significant difference in energy between the two cases, we perform a similar

procedure on the internal power and switching power plots. Once we choose the candidate with high

energy difference we go further down the tree and find the energy difference in the memory,

sequential and Combinational components of both Internal and switching powers. This helps us

narrow down the components of the design that mainly contribute to the redundant power

dissipation. This flow-tree of going down the analysis can be visualized as in Figure 3.7. These

inferences made from the internal and switching energies can be analyzed as in [17].

Figure 3.5: Inferences based on the scenarios of energy difference between the two test cases [17]

Page 47: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 31

31

Fig

ure

3.6

: D

iffere

ntia

l En

erg

y A

naly

sis

Imp

lem

en

tatio

n flo

w

Page 48: Low-power memory controller subsystem IP exploration …

32 | Power analysis and reduction framework development methodology

In Figure 3.5 the red arrows translate to an increase in the energy in the stalled test case compared

to a typical job execution. The green arrows represent no noticeable change in the energy calculated.

Inferences can be made on the sources of inefficiencies based on the components that show significant

energy difference as shown in the table. A very common scenario is represented in point 2 where

redundant clock toggles continue to occur even when data on the D/Q pins don’t toggle. These are

pointers to implement the widely used clock gating in the design. This is a proven positive technique

that improves the Dynamic power performance of the designs.

Once the major problem areas are identified by going down this analysis tree, we are enabled to

look into the specific areas for localized redundancy reduction by looking at the register level metrics

computed in the power analysis in Spyglass. Once the root causes of these problems are identified

they can be fixed at RTL and re-simulated for power analysis. This enables early power bug detection

and therefore the improved power performance of the designs. The flow of the complete analysis can

be visualized as in Figure 3.8.

The flow starts with the power test cases identified from the previous sections using the functional

verification framework. These tests are simulated with and without stalls controlled from the

verification environment to obtain switching activity details. A power analysis run over the two cases

leads to the power profiles for the different power components discussed in previous sections. These

power profiles are used to calculate energy consumption and are categorized based on the power

components for memory, sequential and combinational circuits. This helps localize problem areas

and once we have narrowed down specific pointers for RTL based optimization are taken up.

Figure 3.7: Differential Energy Analysis - Flow sequence for redundancy localization

Page 49: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 33

33

3.2 RTL based early Power Analysis and Optimization

Once the power analysis flow is enabled to obtain stimulus from the performance verification

framework, the knobs that control the test parameters are specified, created, and tested. Then the

optimization test-cases are filtered out and tailored for the flow (using techniques like the differential

energy analysis), the flow proceeds on to the power analysis. For this Thesis, Spyglass Power (from

Synopsys) is used as the main analysis tool in the RTL based early power analysis flow. This choice of

early analysis is motivated by the obvious advantages in early problem identification in the IP design

flow. The power analysis flow focusses on a two-fold purpose for any design block under analysis.

These purposes being characterization and optimization of the blocks.

Figure 3.8: Analysis and problem localization using Differential Energy Analysis

Figure 3.9: Purpose-to-use case/operating point correlation to utilize the flow for ASIC IP blocks

Page 50: Low-power memory controller subsystem IP exploration …

34 | Power analysis and reduction framework development methodology

The goals of the power analysis flow developed in the Thesis are connected to the use-cases of the

ASIC IP block as shown in Figure 3.9. The operating points in the figure originate at the system level.

These are propagated down to the block and subsystem level. The operating points are categorized

based on the various utilization scenarios. Idle scenario involves low utilization, like situations

involving sleep or up from reset. The typical operating point refers to the typical usage scenario of the

ASIC IP block. High operation point refers to the use-cases that are generally short-lived high

utilization scenarios of the block. The thermal operating point is derived to consider overutilization

to plan for thermal and cooling requirements. Special cases refer to the specific use-cases of the block

developed to uncover power inefficiencies and optimize the block. These operating points propagated

down from system-level are all used to collect characterization data for the block that can be used as

a database that would add up with all the sub-blocks in the ASIC and help estimate the power with a

good correlation using a System-level power analysis. The only use-cases interesting for uncovering

power bugs and optimization are the low and typical operating points as these are the points that can

help uncover redundant dynamic activity. The special cases like the Differential energy analysis test

cases are only used as test cases to provide pointers to optimization.

3.2.1 Characterization of the block

Power characterization is an important requirement for ASIC IP Blocks. All ASIC designs have

system-level power and thermal design requirements, corresponding to different operating points.

These system-level operating points are propagated down to the subsystem and block levels. Hence it

is important to characterize the power of a hierarchical IP block such as Common Memory Controller

(CMC). Therefore, we use the performance verification framework and the power knobs developed to

incrementally load the CMC to track the scaling of activity and power profile with the load capacity.

We monitor these power-split-up numbers, activity, and gating efficiency numbers with varying

loading, document, and plot them. The goal is to have the power and activity numbers baselined for

a design. These serve as good starting points when working on optimizing the blocks or when scaling

the block for future designs. This helps predict the power and thermal performance in prior, by the

SOC architects, in a relatively accurate manner. This leads to better planning and reduced risk of

uninformed decisions. Figure 3.10 represents a sample of the system-level operation point

characterization, which can be propagated down to sub-block levels.

Figure 3.10: A representation of power-based operating points of an ASIC IP Block

Page 51: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 35

35

In the case of CMC, the block was loaded incrementally in terms of the number of ports/clients

accessing memory areas. This was done by simulating the power test cases used from the verification

framework, with different arguments provided as input for the number of port accesses knob. The

argument was varied across all logical values relating to the range of available ports on CMC. Each of

these simulations produced an FSDB file which was provided as the input for the corresponding

Spyglass Power analysis for that loading condition. The inputs and other fundamentals for the

Spyglass analysis are as discussed in section 2.2.2. An activity analysis is performed first for the

different operational points followed by Spyglass Power analyses for these test cases. The power

analyses focus on collecting, analyzing, and visualizing the power metrics for the varied loading

conditions. The metric, tabulated and plotted are the Activity in the design, power consumption, and

their split-up and the clock gating metrics discussed in this Thesis. The trends in these metrics with

loading are studied and visualized. The technology-independent metrics collected in this process are

helpful for the SoC architects to foresee design requirements such as a power delivery system, a heat

exchange mechanism, and other aspects of product design.

The cost of access, in power numbers, of a memory access by one client, is one of the other

characterization metrics calculated for the CMC. Then, the study of the power cost with respect to the

increase in the number of clients performing unique memory accesses is done. This helps achieve a

relationship between the load on CMC and the power. This helps obtain the trend of the cost per

access as the number of clients increase. This is performed by isolating test cases for the cost analysis

using the knobs and incrementing the clients over a few cases such as 1 client, 5 clients, and 10 clients.

This variation is to monitor the tracking of cost as the clients increase and look for a relationship for

higher client numbers. The memory access location is fixed, and a default of 100 access is captured to

average the cost over. Then activity analysis is performed over the block by using the VCD2RPT++

tool. The active hierarchical elements during the access are monitored and noted for all the test cases.

Once the active blocks are identified using this tool, power analysis is done using Spyglass Power, and

the power numbers corresponding to the individual access are gathered. This leads to the estimation

of power for individual memory accesses for clients. Also, the overhead that comes to the cost for

multiple clients is tried to be captured by performing this on 5 and 10 clients accessing memory.

A study on the CMC with default values of all the power test knobs except the number of port

accesses, with the incremental analysis on it, lays the basis for characterization of the blocks. The

number of client accesses is one of the major knobs in controlling the loading of the block although

similar analysis can also be done on other knobs such as intensity in characterizing the block. From

the Spyglass analysis all possible power numbers and power efficiency metrics are captured,

tabulated, and visualized. The activity parameters captured are the average activity, the average

register activity, the average register D pin activity, and the average combinational net activity. The

power-related metrics captured are the average total power, internal power, switching power, total

Dynamic power, and Static leakage power. The power performance metrics captured are the clock

gating efficiency, the average Register Output activity density for Flop (ROADF), and the average

Register Output Activity for Enables (ROADE). This forms a baseline analysis that characterizes CMC.

The characterization of CMC leads to data points corresponding to the technology-independent

metrics such as activity and clock gating efficiency, and other technology-dependent numbers such as

power split-up. The technology-independent metrics when characterized and well documented in a

design, lead to forming good supporting data for Electronic system-level (ESL) power analysis, which

is a more holistic IC level analysis. ESL power analysis utilizes physical layout and fabrication vendor’s

inputs to be extrapolated to predict modified future design decisions by using the technology-

independent metrics that are characterized by this flow. It helps provide the largely needed starting

point to predict, sufficiently accurately, the effect of modifications on baselined blocks in use.

A series of relationships are arrived at from the metrics and power numbers attained as data

points and presented as graphs and these details are shared with the SoC architects and other

Page 52: Low-power memory controller subsystem IP exploration …

36 | Power analysis and reduction framework development methodology

shareholders of the blocks future designs to help make extrapolations on power estimates and foresee

future requirements such as power management systems and heat management systems.

This form of characterization at the block level helps form a database of power behavior of each

block under different loads. The database is formed out of the metrics such as register and

combinational switching activity and the memory utilization. When such ‘power behavior databases’

are accumulated for all the blocks, system-level analysis of power is enabled and forms the basis of

analyzing the system power and foreseeing improvements as a system.

3.2.2 Analysis and Optimization flow

Apart from characterizing the design block, one of the major motivations for early power analysis is

the optimization of the block. The Thesis focuses on dynamic power optimization by eliminating

redundant switching activities in the design. This is achieved through a sequence of steps starting

with power analysis using the Spyglass Power tool. Introducing early identification of power bugs and

optimization into the IP design flow, improves the whole flow, and reduces the space for finding costly

bugs later in the design flow. The goal of this power analysis flow implementation is to develop a flow

that can be incorporated into the design cycle, similarly to a code coverage analysis where realistic

goals are set for designs, striving for which iterative analyses are performed.

An analysis and optimization flow is feasible through the use of metrics that quantify the current

performance. In the case of power optimization, the metrics associated with power need to be

understood and quantified clearly before they can be optimized for. The metrics choice and definitions

are explained in section 2.3.2. The goal is to set metric goals as part of the power analysis flow using

which the design needs to be optimized. These goals need to ensure good dynamic power performance

of the design and yet be realistic for the design. A suggested approach for a new implementation of

the power analysis goal is to start with qualitative soft goals and fine-tune these goals with time and

incremental knowledge of the block’s power performance in the use-cases they are exposed to. The

dynamic power performance metrics mainly used in the flow are the dynamic and static clock gating

efficiency (DCGE and SCGE) and Register Output Activity Density for Enables (ROADE).

So, this flow starts with a power analysis on the design using Spyglass Power. The power analysis

is performed in a sequence of steps as described in section 2.2.2. This provides a hierarchical split-up

of power numbers and the metrics discussed. Once the analysis is set up and all the power and metric

results of the analysis are available for an optimization test case (generated from the performance

framework), we perform the analysis as a flow described in Figure 3.11. This is performed on the test

cases that were developed specifically for optimization using techniques, such as Differential Energy

analysis, described that are used to tune test cases for optimization. The differential energy analysis

used to arrive at such test cases also provides pointers to the specific areas of the design that are the

main contributors to dynamic power inefficiency.

The flow in Figure 3.11 describes the iterative approach to power optimization. The flow uses a

metric based optimization flow, where the metrics are the outcome of a Spyglass Power analysis run

on the design. A good dynamic clock gating efficiency is a metric that rules out the need for any further

optimization. Hence that is the initial hurdle that the design has to pass. For Common Memory

Controller (CMC), a good soft goal of 80% is set for DCGE for this flow development. Designers

associate good estimates for the metric goals and fine-tune them based on the knowledge of the

design’s power performance gained over time. If the DCGE is insufficient, the flow proceeds to analyze

the hierarchical level further. There could be two reasons for low DCGE. The first reason is a low

percentage of instantiated clock gates in the design. This is quantified by the static clock gating

efficiency (SCGE). A low SCGE means insufficient clock gates in the design. This is mitigated by the

instantiation of more clock gates in the design. This can be done by going down the hierarchy to the

register level and identifying the register banks without the provision for clock gating and then

Page 53: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 37

37

introducing gating conditions in the corresponding RTL. We set a soft goal of 90% on SCGE. This

should necessarily improve the SCGE of the design on the next iteration of power analysis. But that

might not be the case with DCGE. So, in the subsequent iteration the DCGE is rechecked. If the SCGE

improves and the DCGE does not improve, the flow proceeds to check the next metric, Register

Output Activity Density for Enables (ROADE). ROADE is a metric that reflects the quality of the clock

gating enables. If the ROADE is not high enough the gating conditions are not good, and this has led

to inefficient clock gating in the design. This motivates going down hierarchies where ROADE is low

and identifying register banks with poor enable conditions and logically improving them. This should

improve the ROADE. A soft goal of 95% is set on ROADE of the design. A hierarchy meeting the

ROADE goal, mitigates the failing of the DCGE goal. This is the implication of the fact that the DCGE

reaches its peak value for ROADE greater than 95% and the DCGE cannot go any higher for that

design because of the number of data transitions, that need clock edges enabled, to be captured. A

failure in the ROADE goal should lead to modification of the clock gating enable conditions for

registers with bad ROADE. Improvement in ROADE will necessarily lead to increased DCGE. Thus,

RTL change is performed for this modification and power analysis iteration is performed. Once all

the metrics in the sequence are passed, the flow proceeds to the next hierarchical entity. This flow is

continued across all hierarchical elements until all elements are passed or waived for special reasons.

Such an analysis can be compared to the code coverage analysis.

Such a flow is developed and performed on the CMC and as a result, a document was created that

suggested the register level pointers to the improvement of the metrics with the scope for

improvements. This was shared with the design team at Ericsson and discussed for taking up as

changes that could be made to improve the design. An optimization effort is made based on the

analysis of one of the sub-blocks by RTL modification as well.

Page 54: Low-power memory controller subsystem IP exploration …

38 | Power analysis and reduction framework development methodology

Fig

ure

3.1

1:

Dyn

am

ic p

ow

er o

ptim

izatio

n im

ple

men

tatio

n m

eth

od

olo

gy

Page 55: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 39

39

3.3 Netlist based power analysis

An early power analysis is focused on as the first step in the Thesis because, the earlier amends are

made in the power perspective of a design, higher the impact on the power savings. But it is to be

noted that the accuracy of power estimation is higher, the further down the design stage the ASIC IP

Block is in. Thus, the power analysis flow developed in the Thesis needs to incorporate pre-layout

netlist-based activity and power analysis and this section deals with this. Power analysis towards the

stages of signoff power estimation for relatively more accurate results due to higher switching-

activity-annotation accuracy.

Functional verification environments such as UVM and TLM are built around RTL model

simulations. This presents challenges to plug-and-play netlist DUTs in such an environment. So,

when accuracy motivates Gate level power analysis, Gate level simulation is necessitated. Developing

a verification framework around netlist DUTs is a time and resource-intensive task that can inhibit

gate-level power analysis [21]. This leads us to the first steps to gate-level power analysis using the in-

house tool at Ericsson. This tool enables gate-level simulation using an approach to leverage the RTL

verification environment with a netlist DUT for simulation. The subsequent step uses PrimeTimePX

from Synopsys to perform netlist level power estimation for accurate signoff power estimation. This

can be used to as the characterization tool after the iterative power analysis flow finalizes the design.

In this Thesis this tool is used for validation of power savings for a sub-block of Common Memory

Controller (CMC) after modifying the block to implement clock gating after a scope for dynamic power

inefficiency is identified. The subsequent section proceeds to the first steps of netlist-based simulation

and power analysis.

3.3.1 VCD2TB – Netlist simulation using Dump, Convert, Replay

VCD2TB is an in-house tool designed at Ericsson that helps gate-level simulation for power analysis

using the RTL verification stimuli used for simulation [21]. It enables quicker Gate level simulation

by generating stimulus from the RTL simulation environment that can be replayed on the netlist. It

can be understood from the sequence of steps as depicted in Figure 3.12.

The procedure towards gate-level power analysis starts with an RTL simulation. The RTL design

is simulated in cohesion with the UVM based verification environment. As a result of the simulation,

an input stimuli VCD is generated. The VCD2TB tool utilizes the input stimuli VCD to create a

testbench that can be used to simulate a netlist on compilation. As a result of the gate level simulation

a gate-level VCD file is generated for activity and power analysis. This VCD can be used to study the

activity and Clock gating efficiency using the VCDRPT++ tool. The gate-level VCD can also be used

as input for a power analysis tool such as Spyglass Power along with the netlist design to estimate

power and related metrics with higher accuracy. This technique significantly saves time and the

resources requirement by eliminating the need to generate a verification framework for Netlist based

simulation. It follows the sequence of dumping VCD based on the RTL simulation, converts the input

stimuli into a testbench, which is then replayed with the netlist to simulate and then generate Gate

level VCDs which can be used for a more accurate analysis of power.

Page 56: Low-power memory controller subsystem IP exploration …

40 | Power analysis and reduction framework development methodology

This flow/approach leads to a quick netlist-based Gate level simulation and VCD dumping which

enables power analysis. The ability to use such a tool comes to use in an iterative analysis flow such

as the one developed in this Thesis and avoids the need to simulate at gate level and also helps achieve

and visualize activity profiles extremely fast in comparison to using a netlist-based power analysis

tool such as PrimeTimePX, which is preferable for final validation steps and for a specific narrow time

window in a test case leading to narrow simulation which is not as resource-intensive.

3.3.2 PrimeTimePX –Gate level sign-off power estimation

PrimeTimePX is a power analysis tool as introduced in section 2.2.3, from Synopsys that is used in

the Thesis for a more accurate gate-level sign-off power estimation. This forms one of the later steps

in the analysis flow. One of the major scenarios where the Thesis flow utilizes PrimeTimePX is to

validate an improvement scenario identified using methodologies discussed.

PrimeTimePX is a tool utilized in the power analysis flow when narrowing down on problem areas

and trying to validate improvement through modification, during the later stages of the analysis. One

such problematic area was identified in the Common Memory Controller (CMC). Problem areas were

identified early in the flow and validated for modification in the later stages of the analysis flow. Once

the problem areas are identified, the subsequent steps are to implement RTL changes to identify clock

gating conditions and implement them as discussed in section 2.3.1. After modification of the RTL,

for finalizing the changes, a netlist level power estimation is performed to validate optimization. This

involves synthesizing the modified RTL to generate a new pre-layout netlist and then using this for

the subsequent PrimeTimePX power analysis. A comparative analysis is performed between the two

Figure 3.12: Using VCD2TB for gate-level simulation

Page 57: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 41

41

versions of the design, one with manually instantiated clock gates in RTL and the other without it.

The results of the two are compared and analyzed. This is used as a sign-off validation step in the

analysis flow developed in the Thesis. This example analysis performed in CMC is discussed in the

subsequent section.

3.4 Cache block analysis – A sample analysis & optimization

The cache block in Common Memory Controller (CMC) is used as an example for the analysis and

optimization flow achieved using the flow developed in this Thesis. This section deals with the analysis

of the cache-block for issues in the power metrics visualization, hypothesizing the cause of the

problem from the analysis, identifying the corresponding root-cause in RTL, modifying the RTL for

improved performance, and the eventual validation of the modification using sign-off power analysis

tool, PrimeTimePX.

3.4.1 Problem

The power analysis flow developed in the Thesis is in such a way that it begins at a higher level in the

hierarchy and digs down to specific problem areas that can be improved. The CMC consists of a cache

subsystem whose early analysis points to issues related to bad clock gating. The first step to analysis

was to simulate the block in a test case that expected low activity on the cache block. Then, the input

VCDs generated from the simulation are used as inputs to the VCD2TB tool as discussed in section

3.3.1. This generates the testbench for the Gate level simulation. The gate-level VCD generated from

the simulation is used to analyze activity and clock gating efficiency using the in-house tool

VCD2RPT++. This helps visualize the activity and the Clock gating efficiency of each of the

hierarchical elements in the CMC across the duration of the test case.

The cache block consists of associated cache memory, a controller and its interface block. It

handles cache access requests from clients and provides access to the content of the memory if it is a

Figure 3.13: Methodology of identifying power inefficiency in CMC - VCD2RPT++ screenshot

Page 58: Low-power memory controller subsystem IP exploration …

42 | Power analysis and reduction framework development methodology

‘cache hit’ else it sends an intimation of a ‘cache miss’ and proceeds to move the data into a free cache

slot. Figure 3.13 shows the screen captures from the VCD2RTP++ tool’s user interface, focusing on

the cache sub-block, for a test case that expects low/almost no activity on the block at a time instant

in the test. It shows the sub-blocks within the cache block in CMC, visualized as rectangles, where

each rectangle represents a hierarchical element inside the cache. The activity view on the left shows

the activity across the sub-blocks (color-scaled as in the scale on top of the screenshot) and the CGE

view (color scaled according to the scale on top of the screenshot) on the right represents the

visualization of clock gating efficiency across the sub-blocks. The highlighted sub-block with the dark-

blue outline represents an acknowledgment (ACK) first-in-first-out queue (FIFO) sub-block in the

cache that handles acknowledgment requests for cache accesses. There are multiple instances of such

ACK FIFO, and this analysis narrows down to this level for analysis and optimization.

Figure 3.13 shows that the ACK FIFO highlighted in the two views, at a time instant of the test

case shows a low (from the color code) activity, which is expected from the test cases that doesn’t

create high activity in the cache block. The CGE view shows that the clock gating efficiency is low

(from the color code) for the same FIFO sub-block. This points to redundant clock activity when data

/functionality doesn’t need the hierarchical element to be active. A low activity, yet a low clock gating

efficiency points to redundant clock switching which is unnecessary for the functionality of the design.

This analysis helps hypothesize that there is some form of redundancy that creates bad clock gating

efficiency in low activity scenarios. This can be seen across all the six instances of the ACK FIFOs in

Figure 3.13. This indicates to scope for improvement in the clock gating scenario of the FIFO blocks

by modifying the RTL. This visualization-based technique is one of the techniques the flow uses to

pin-point coarse-grained redundancies early in the design.

3.4.2 Solution and validation

As the analysis above identified the possibility to improve clock gating conditions in the RTL of the

ACK FIFOs, the subsequent step is to study all the possibilities for clock gating enable conditions in

the RTL corresponding to these FIFOs. The goal of this step is to identify redundant read/write

operations where the logic does not gate/disable redundant clocking where there is no update in the

data. Such a situation is identified for the read operation of the RTL, the code snippet of which is

represented in Figure 3.14 and Figure 3.15.

By focusing on the problematic area based on the metrics evaluated in the power analysis flow,

we identify the modules whose RTL can be modified to improve the power performance by

introducing clock gating. This is the case with the ACK FIFOs when the analysis discussed above

pointed to improper clock gating efficiency, the RTL was analyzed for better gating conditions for the

registers in the FIFO. As shown in the code snippets compared in Figure 3.15 and Figure 3.14, it can

be analyzed from the RTL that the read operation is redundant in the original design without clock

gating for cases where there are not data updates. This leads to using a data update checking flag to

be used as an enable for the clock gating, as highlighted in the green box in the modified snippet with

CG insertion. The If condition wrapping the complete array read operation, disables redundant reads

when there is no new data to be updated. This modification is aimed to improve the switching

performance of the ACK FIFO in the Cache interface block.

The cache interface block consists of multiple such FIFO ACKs. It is noted that a similar

modification for all these instances of the FIFO can translate to an activity and power performance

improvement accumulated over the number of instances. The Common Memory Controller (CMC)

consists of multiple interconnect blocks, each consisting of its cache block. This leads to a multiplied

improvement of power performance across the chip by identifying a single such scope for

improvement. A PrimeTimePX analysis is performed to validate the significance of this modification

on power and related metrics. These results are analyzed and discussed in the later chapters.

Page 59: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 43

43

Figure 3.14: FIFO RTL without Clock gating for comparison

Page 60: Low-power memory controller subsystem IP exploration …

44 | Power analysis and reduction framework development methodology

Figure 3.15: FIFO RTL with Clock Gating

Page 61: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 45

45

The procedure involved in analyzing the cache block in CMC using PrimeTimePX (PtPX, in flow

diagram) is visualized in Figure 3.16. The results are collected from the PrimeTimePX reports and the

effect on the RTL changes on the netlist are also considered in the decision-making process. The

validation step using a sign-off power analysis tool such as PrimeTimePX forms the critical,

penultimate step of the power analysis flow before making the decision on the effect of the RTL change

iteration’s validity.

Figure 3.16: Validation and signoff analysis of RTL improvements

Page 62: Low-power memory controller subsystem IP exploration …

46 | Power analysis and reduction framework development methodology

Fig

ure

3.1

7:

Po

wer a

na

lysis

an

d R

ed

uc

tion

Flo

w S

um

mary

Page 63: Low-power memory controller subsystem IP exploration …

Power analysis and reduction framework development methodology | 47

47

3.5 Summary – An end-to-end power analysis and reduction flow

The procedures discussed so far in this chapter form the steps of a comprehensive power analysis and

optimization flow developed in this Thesis. The flow starts with a focus on its two purposes –

characterization and optimization of the ASIC IP Block. Beginning with identifying and creating test

cases for the two purposes by connecting to the existing performance analysis framework, the flow

proceeds to different approaches to analysis, characterization, and optimization.

Figure 3.17 provides a summary of the tools and flow of methodology used in the development of

the Thesis. The flow representation shows the two front-end design stages of RTL and the Netlist of

the design and shows the succeeding steps in each case for the implementation of the discussed power

analysis and optimization flow. The flow covers the design tools, the simulation and verification tools

and the analysis tools involved in the flow developed. An ideal early power analysis starts from the

top-left corner of the flow and finishes at the bottom-right corner of the flow. All the tools represented

in the flowchart and explained in the previous sections provide the varied dimensions and

opportunities to analyze the block for power and improve. Each tool used forms a platform to power

perspective although the complete sequence of flow in the chart provides a suitably more elaborate

power perspective of the designs under analysis. As shown in the figure, this power analysis flow

addresses the major goal of connecting the power analysis flow to the block’s verification framework

to generate and drive its stimuli. Since power analysis is primarily a stimuli-based analysis, this

connection to the verification frameworks is a critical step of the power analysis flow implementation.

This forms the basis to create power test cases that form the basis for the successive steps of the flow.

The RTL design stage-based analysis shown in the flow comes with the early advantages of power

savings. The earliness of this analysis leads to two obvious advantages. It leads to larger implications

on power savings as discussed in the previous sections. It also leads to faster, iterative analyses that

lead to a flow that is well-suited for such an analysis and sits in well with the other iterative verification

steps in the IP design flow. This flow representation in Figure 3.17 is validated using the power tools

for the Common Memory Controller (CMC) block at Ericsson. The goal of this thesis is to present this

flow as a methodical approach that can be implemented for any ASIC IP Block using power analysis

tools that can perform similar functions. This flow diagram can be understood to be representative of

a more generalized flow that can be developed independent of the ASIC IP Block on which the analysis

is done, or the tools used to perform these.

For developing power-efficient ASIC designs, a flow such as that discussed here needs to be

implemented on all the hierarchical elements developed at the respective levels. The results of the

power analysis on CMC using the tools described in the flow are collected, visualized, and analyzed

for the two purposes for which the flow has been developed, namely the characterization and

optimization of CMC. The results consist of data necessary to characterize a block design release as

part of the design flow. The result also consists of data from the optimization and validation steps of

the power analysis flow developed, where an optimization at RTL is verified for power savings based

on the validation step of the flow developed involving the netlist-based signoff power analysis tool

PrimeTimePX. The metrics collected for the validation are analyzed and a decision is taken regarding

the validity of the RTL modification for power. These results obtained are discussed in the upcoming

section. The results are consolidated from different parts of the flow.

Page 64: Low-power memory controller subsystem IP exploration …

48 | Results and analysis

4 Results and analysis

The main goal of this Thesis has been to arrive at a power analysis and reduction flow for CMC using

the power tools and the existing verification framework and then validate the flow for its

characterization and optimization goals for the Common Memory Controller (CMC) block at Ericsson.

The flow development and implementation discussed in the previous sections are the first result that

the Thesis has strived for. The subsequent results collected for characterization and optimization of

CMC are an outcome of this flow put to proper use and a means to validate the flow. This report tries

to present the results from the Thesis by focusing on the outcome of the power analysis flow without

going into detailed information relevant to the design blocks at Ericsson, rather focusing on the

applicability of the flow developed. This section tries to present the results in a way of validating the

power analysis and reduction flow using relative metrics without focusing on the absolute

performance metric values relevant to CMC.

4.1 Characterization results for Common Memory Controller (CMC)

The characterization of the CMC is performed as a result of the power analysis flow developed in this

Thesis. This goal of the flow focusses on profiling the blocks on the basis of the power metrics. This

section deals with the results of the characterization of CMC. The characterization results are based

on the methodology as discussed in section 3.2.1. The CMC block is simulated over the range of the

number of clients accessing memory and power metrics are collected. The results are collected based

on the simulation on the variation of the knobs on the test case, which are discussed in section 3.1.1.

The characterization data involves a correlation between the number of clients/load scenarios

(one of the power knobs in the performance analysis framework) on the block and other power metrics

discussed in earlier sections. These are data collected from power analysis for different test scenarios,

which simulate different operational scenarios, and visualized for comprehension. The first block-

level visualization focusses on the variation of the average activity of the design with different

numbers of clients per interconnect block. The activity is split into average combinational activity,

average register activity, and average activity.

CMC block is incrementally loaded to simulate the different operating points discussed in section

3.2.1. Technology independent metrics are collected across the operational points. The system-level

operating points are achieved using the knobs of the flow (discussed in section 3.1.1) in the

performance analysis framework of CMC. The power/activity metrics of CMC are collected with the

incremental loading of the block and operating points are achieved. This contributes to forming a

‘power profile database’ that can be collected using such metrics for all sub-blocks of the ASIC and

thereby provide the complete picture of the power performance of the ASIC. Table 1 presents the data

that classifies the different operating points for CMC and corresponding metrics. The first column in

Table 1 captures the qualitative states of the test knobs for which the metric measurements

correspond. The data like that for CMC in Table 1 collects the technology-independent metrics of the

block that, once collected, are easy to port across different projects and products that utilize the block

or an extension or reduction of the block.

Page 65: Low-power memory controller subsystem IP exploration …

Results and analysis | 49

49

Knob states: Stochastic access w/ Average access size = Medium size (~100B); Access intensity = Medium intensity (~70 clock cycles) between Start-of-

transactions; Unrestricted Interconnect block access (uniform access); No. Of accesses per port = Medium (~100)

S.no Operating

Profile Total Ports

Avg. Reg. Activity (%)

Avg. Activity

(%)

Avg. Combinational

activity (%)

Avg DCGE

(%)

1 Low

5 0,09463 0,2649 0,6759 73,863

2 10 0,1266 0,3026 0,7634 73,758

3 Typical

25 0,2229 0,43 1,1 73,448

4 50 0,3893 0,6347 1,5 72,926

5 High

100 0,7276 1 2,4 71,913

6 150 1,1 1,4 3,3 70,92

7 Thermal

200 1,4 1,7 4 69,972

8 250 1,7 2,1 4,8 69,019

Table 1: Operating point vs profiling metrics data for CMC

Figure 4.1: Correlation between activity and loading condition of the block

0

1

2

3

4

5

6

1 2 5 10 20 30 40 50

Avg

. Act

ivit

y (

Per

cen

tage

)

No. of ports access per Interconnect block

Average Activity (Percentage) vs Number of clients access

Combinational Activity

Average Activity

Average Register Activity

Page 66: Low-power memory controller subsystem IP exploration …

50 | Results and analysis

Figure 4.1 shows the plot where all the activity metrics are represented. The increment of

port/clients accesses from the verification framework forms the x-axis of the plot. This plot highlights

a trend of increasing activity with increased port accesses. The trend seen in the plot is predicted, as

a higher number of clients/ports accesses means higher utilization, and therefore higher activity.

Considering the definition of activity, that is the average percentage of combinational nets

(combinational activity), register’s clock and/or data nets (Register activity), and all nets (Average

activity), that are transitioning at a given clock edge, the data from these results indicates that in

comparison to other nets of the CMC design, the average percentage of combinational nets active

during clock transitions are higher. The plot shows the variation of combinational, average register,

and total average activity as a function of the number of port accesses. It is notable from the plot that

the incremental trend is common across all the activity components (combinational nets, registers,

and overall).

The metric that follows activity analysis is the clock gating efficiency of a design. The variation of

clock gating efficiency with the number of ports/clients is plotted as in Figure 4.2. The plot shows the

trend of decreasing average dynamic clock gating efficiency of CMC as the number of port/client

accesses increases. The values of the performance framework knobs are as specified in Table 1. The

knobs try to qualitatively capture the typical operating scenario of the ASIC.

The decreasing trend of dynamic clock gating efficiency with the increasing number of clients is

anticipated as higher activity leads to lesser clock transitions that can be disabled. This follows from

the correlation that higher load leads to higher activity and therefore lesser registers can be gated to

meet the functionality at that activity level. The interesting point to be noted is that the average

dynamic clock gating efficiency at lower port utilization is not as high if a linear behavior is expected

between the loading of the block and clock gating efficiency. This implies that for low activities the

dynamic clock gating efficiency is not as high as it could be if the curve was linear. For an ideal design

the curve can be extrapolated backward to be linear. This points to the fact that a higher deviation of

the DCGE curve from the linear interpolation of the curve for low activity scenarios points to larger

66

67

68

69

70

71

72

73

74

75

5 10 25 50 100 150 200 250

DC

GE

%

Number of ports

Avg. Dynamic CGE

Avg. DCGE

Figure 4.2: Variation of the average clock gating efficiency with increased loading of the block

Page 67: Low-power memory controller subsystem IP exploration …

Results and analysis | 51

51

inefficiencies in the design. They show up during the low activity scenarios where switching is not

expected, and redundant switching continues to exist, and clock gating is not as efficient as it could

be. Activity and clock gating efficiency metrics, being a technology-independent parameter, its

variation with different client numbers help to extrapolate design performance when including

technology details using an excel-database based Electronic system-level power analysis (ESL). ESL

power analysis is performed based on a database of such metrics collected across the blocks in an

ASIC and forms a basis for a more realistic ASIC level power analysis relevant to specific projects and

scaling them to design needs and technology-specific details from vendors.

Once the technology-independent terms such as the activity of the design and Clock gating

efficiency are visualized and baselined for a design, an important characterization requirement is the

block-level power consumption. The area of power characterization focusses on the study of variation

of switching power, internal power, total dynamic power, leakage power, and the total power with

respect to the incremental loading conditions, in the typical working scenarios of the block. The plot

in Figure 4.3 plots the variation of average total power, average switching power, average internal

power, and the total dynamic power. These power metrics are represented as a factor of the switching

power corresponding to one port access. This helps provide a relationship between the values for

understanding the trend without getting into numerical details of the power consumption of the

block. This kind of a relationship serves the SoC architects to predict the performance of the block for

the various loading scenarios and therefore help perform thermal planning and other supporting

aspects of ASIC IP design and scaling for future products.

The above characterization plots provide an overview of the power, activity, and clock gating

efficiency numbers of the block, that contribute to the power performance of the design. Any design

Figure 4.3: Power split-up relationship with respect to incremental loading of block

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

5

5 10 25 50 100 150 200 250

Po

wer

as

a fa

cto

r o

f D

ynam

ic P

ow

er f

or

5 p

ort

ac

cess

No. of ports access

Power as a factor of 5 port dynamic power vs Number of clients access

Switching Power

Internal Power

Total Dynamic Power

Leakage Power

Average Total Power

Page 68: Low-power memory controller subsystem IP exploration …

52 | Results and analysis

needs to have such an analysis performed and the data baselined and created as a database to be used

in cohesion with other block details. Such a database helps perform Electronic system-level power

analysis that helps perform chip-level decisions and predictions. These informations help take

peripheral, thermal-design related decisions for future designs of the block, which are very important.

This lays down the scope for maintenance, optimization, and enhancement of the design block and

acts as a starting point for these approaches. These are some of the results the power analysis flow

developed aims to have collected and visualized for every ASIC IP block under analysis. The goal of

the flow is to incorporate block characterization as an outcome inclusive in the IP design flow.

4.2 Power analysis results for optimization

The second objective of the power analysis being the improvement of the design for power, is

addressed using the power tools, analysis metrics, and the power analysis flow developed in this

Thesis. The results are based on the spyglass analysis discussed in section 3.2.2.

As a summary of section 3.2.2, Figure 3.11 depicts the flow of the optimization flow developed in

the thesis using Spyglass. This involves a methodology of iterative analysis towards achieving certain

power metric goals. Modifications to the RTL design are made based on the problems identified based

on the metric goals as described in the flow. An analysis of this type generates a detailed worksheet

with issues identified to the register level.

One of the results of this analysis is a ‘Power optimization worksheet’ that addresses each

hierarchical block in Common Memory Controller (CMC), looks at its metrics, static clock gating

efficiency (SCGE), Dynamic clock gating efficiency (DCGE) and Register output density for Enables

(ROADE) and points to 2 major dynamic power improvement pointers.

• Insufficient Clock gating in design – Points to scope for enabling more instantiated

Integrated Clock Gating Cells in the design.

• Inefficient Clock gating in design – Points to scope for improvement of the CG enable

when it is feasible that further activity reduction can be done by disabling design regions

for more clock cycles, thereby meeting the optimum switching for the designed

functionality.

The CMC was studied based on these metrics and a power optimization worksheet was developed

that highlighted pointers, to designers, of the possible focus areas to power savings. It highlights the

hierarchy-wise metrics and pointers to the two above optimization problems. Each problematic

hierarchical element is addressed in detail until register level problems are identified. The design

team was called in to evaluate these findings and decide on taking up or waiving identified issues

based on the criticality and feasibility of the optimizations. Such a discussion was held with the CMC

design team at Ericsson.

Once optimization pointers are collected, motivated modifications made to the design are

validated using the signoff power analysis tool, PrimeTimePX. Section 3.4 discusses such an

identification of a possible area of optimizing the dynamic power in the ACK FIFOs in the cache

interface blocks. The power inefficiency pointers were identified for the cache block in CMC. The RTL

design of the cache block is modified to introduce better clock gating enable conditions as discussed

in this section. Henceforth power analysis is performed as a validation step. In this section, the power

measurement result of the modified RTL based netlist is compared with respect to the original netlist-

based power analysis for the cache block in CMC. This section presents the results from the validation

step using netlist level power analysis. A gist of the results is presented in Table 2. The analysis

focusses on the Cache interface block which contains the ACK FIFOs which are modified for this

validation effort. (discussed in section 3.4)

Page 69: Low-power memory controller subsystem IP exploration …

Results and analysis | 53

53

Comparison Metric Metric value after RTL change w.r.t to original RTL

Average CGE (%) Same

Number of Clock gates 29% Lower

Net Switching Power 13% lower

Cell Internal Power 8% Lower

Total Dynamic Power 10% Lower

Total Power 8% Lower

Active area (um2 ) 2% Lower

The table above consists of the power and clock gating metrics used to compare the design

modified for power with respect to the original design, as a validation step. The percentage difference

in these metrics of the modified design from the original design is shown in the table. These metrics

are captured at the cache level in the design, where the cache block consists of multiple instances of

ACK FIFO module on which the modification was performed.

Upon analyzing the results, we see that, although the major goal of the modification was to

enhance the clock gating efficiency of the Cache interface block through improvement of the ACK

FIFO designs, we see that the clock gating efficiency is similar before and after the optimization effort.

This can be analyzed and interpreted as follows. The RTL-introduced clock gating achieves no

improvement in comparison to netlist without explicit Clock gating enable specified at RTL. After

some exploration it is understood that the lack of difference is because the synthesis tool infers

possible gating scenarios although clock gating is not explicitly defined in the initial RTL. The

synthesis tool is smart in identifying possible clock gating scenarios and implements clock gating in

the design with a self-generated clock gating Enable condition. Although the activity analysis tool

VCD2RPT++ identifies the problem in RTL and visualizes it as in Error! Reference source not

found.Figure 3.13, the synthesis tool smartly self -instantiates the clock gates at the synthesis stage.

This can be inferred as the justification for similar clock gating efficiency in both the versions of the

RTL (pre and post-optimization effort).

Table 3 presents the metrics at the ACK_FIFO level for comparison of the modified RTL with

respect to the original. The comparison provides a comparative indication as a validation step for the

RTL modification and the results show a clear improvement in the power metrics presented.

Comparison Metric Metric value after RTL change w.r.t to original RTL

Internal Power 46% Lower

Total Power 36% Lower

Number of Clock gates per FIFO 62% Lower

From Table 2 and Table 3 we understand that although the clock gating efficiency remains

unchanged in the two analyses, there is around 30% reduction in the number of clock gates used in

the modified design. There is around 13% reduction in the net switching power of the cache when the

Table 2: Metric comparison between the modified and original RTL at Cache level

Table 3: Metric comparison of metrics between modified and original RTL for ACK_FIFOs

Page 70: Low-power memory controller subsystem IP exploration …

54 | Results and analysis

RTL modification is implemented. There is about an 8% reduction in the internal power of the cells

at the cache level. The RTL change introduces around 10% reduction in the total dynamic power of

cache and around 8% reduction in the total power consumed by the cache block. There is a 2%

decrease in the estimated area of the cache block by the synthesis tool for the netlist created with the

RTL change. These are the effects on the cache subsystem due to the RTL change on the ACK FIFOs

inside the cache.

At the ACK FIFO level, where the modification was made, it can be noticed that there is around

47% reduction in the internal power consumption of the ACK FIFOs, which is a significant

improvement. There is around 36% reduction in the total power consumed by the ACK FIFOs, which

is a very positive improvement and validation of the optimization effort. It can also be inferred from

the results that the number of clock gates in the ACK FIFO has decreased by around 62%. This is

despite the clock gating efficiency remaining the same. This translates to a point that the explicit clock

gating modification makes the clock gating logic significantly more efficient and enables the synthesis

tools to reduce the number of unnecessary clock gates it decides to place across the design. These

improvements are significant considering these improvements accumulate over multiple

instantiations of the FIFOs that are used in the cache block.

To summarize, although the clock gating efficiency is similar in the two versions, the

implementation of these gating cells and enable conditions are different in the two approaches. This

is validated from the result of the analysis that the modified implementation mandates lesser clock

gates, lower dynamic and lower total power, and lesser area. These point to the idea that the manually

introduced clock gates through the modified RTL seem to be more for the implementation efficiency,

power, and area. Therefore, this helped decide to validate and enable the choice of the modified RTL

approach, although the clock gating efficiency between the two cases did not show a big improvement

as expected initially.

Looking at the whole analysis involving the cache block in CMC using the power analysis flow in

the Thesis, we can draw some conclusions that can improve the design for the power performance.

The modification notably improves several aspects of the design and leads to reduced power

consumption by significant percentages. Such an analysis using the power analysis flow and resulting

optimization of ASIC IP Blocks can lead to significant power savings.

Figure 4.4: Power bug location in a block using Power analysis Flow

Page 71: Low-power memory controller subsystem IP exploration …

| 55

Common Memory Controller (CMC) is an ASIC IP block constituted by millions of lines of code.

These methodologies and techniques that constitute the power analysis flow formulated through this

Thesis help identify parts/lines of the code that cause power bugs and localize problem zones as

depicted in Figure 4.4. This ability of the flow to “find the needle in a haystack” is highly advantageous.

The cache subsystem example, in section 3.4, identified an issue in CMC and localized the problem to

the cache subsystem. This lowered scope improves the designer’s ability to identify and solve the

issue. It was demonstrated that manual instantiation of clock gate in the RTL in the ACK FIFO in

cache improves the dynamic power of the cache subsystem by around 10%. The flow developed helped

narrow down to this level to achieve savings.

The design modifications are validated using netlist power analysis (which provides the savings

numbers). This is an example of the type of analysis that will be important in the later stages of the

Power analysis flow developed in the Thesis. This step as a validation of the optimizations will help

finalize the design changes before being baselined for release.

Page 72: Low-power memory controller subsystem IP exploration …

56 | Conclusion and Future

5 Conclusion and Future

FinFETs, today’s leading-edge transistors are so far delivering on the promise of scalability,

performance, and power, but the road ahead is bumpy and filled with a slew of technical and cost

challenges. The free power reduction that came with improving transistor nodes is no longer the case

and 7 nm a relatively long-lived node. Designers can no longer rely on this auto power scaling to

happen anymore. This necessitated a power analysis flow such as the one developed in this Thesis

that helps characterize the ASIC IP Block for its power consumption and efficiency metrics and helps

perform the major focus area of power improvement in FinFET based designs, dynamic power

optimization.

This Thesis fulfills the goal to realize a custom-tailored power analysis and reduction flow, for a

memory controller subsystem at Ericsson, that connects the existing verification environment to the

low-level power analysis tools. This is enabled by an extraction of power metrics by using in-house

and commercial front-end power tools, leading to a top and sub-block level analysis to facilitate quick

power exploration and profiling. The eventual goal in the Thesis focusses on optimization by pin-

pointing the largest and redundant power consumers in the subsystem during the right workload

scenarios. These pointers are shortlisted to a list of prime candidates and communicated to the design

team and a case is taken up to realize RTL changes and demonstrate achievable power and area

savings.

The Thesis starts with understanding the concepts and ideas of power dissipation in ASICs and

formulizing them using metrics. Then the Common Memory Controller (CMC), the ASIC IP Block

under analysis is studied and its usage scenarios are characterized. Then the verification framework

is understood with enough knowledge to be able to generate knobs and tweak them to create power

test cases. With the understanding that power analysis is only relevant for an ASIC IP Block in

correlation to its use-case, power analysis is performed by generating the power-related uses cases

for the two purposes, characterizing the block and uncovering power bugs in the block. Then the

understanding of the CMC, the test cases, the power metrics are all used to methodically develop a

power analysis and reduction framework at different stages if the design using different tools. The

Thesis also focusses on two stages of power analysis, the early analysis at RTL level of the design which

maximizes power savings and impact on the design by lowering the chances bugs found at later stages

which are costlier. The second analysis takes place at the netlist level where the focus is on the

accuracy of measurement to real consumption and validation of design modifications. At every stage

of the Thesis, the findings have been continuously kept in the loop with the architects and designers

to ensure that the Thesis improves the knowledge and quality of the design in terms of power. The

whole analysis is performed using the designs, verification environments, and tool features and

licenses at Ericsson. All the parts of the Thesis involved learning and the right usage of these

environments and facilities in enabling the Thesis.

The power analysis and reduction flow developed in the Thesis, although focusses on a specific

design block at Ericsson and a set of tools, tries to create a flow that can be generalized to other ASIC

IP Blocks and other tools within or outside of the organization. The focus is on the implementation

methodology, the metrics that support the flow, and the physics behind the reduction of dynamic

power. Although the results, in terms of metrics and power numbers specific to the block, form an

important part in validating the flow, the goal is to present the power analysis and reduction flow as

a plug-in iterative analysis step that fits into the IP design flow at ASIC design centers such as

companies and universities. Such a flow impacts power and area savings, that accumulate to huge

cost and environmental benefits that accumulate when ASICs, SoCs, and other forms of application-

specific hardware options become ubiquitous with the emergence of the Internet of things, edge

computing and other innovative ideas. This thesis aims to motivate to be able to custom-tailor a

similar flow for any ASIC IP Block using any available toolkit.

Page 73: Low-power memory controller subsystem IP exploration …

Conclusion and Future | 57

57

A comprehensive power analysis and reduction methodology such as the one implemented in this

thesis, is a helpful tool in understanding and improving the power behavior of an IP block from the

early steps of the design. It would be highly beneficial as the future work to be able to predict, power,

energy, and be able to optimize them at an algorithmic level. This is where the adoption of a structured

design flow advocated by Silago [22], becomes an interesting approach. Thereby, discretely built-up

power and energy prediction model become more reliable and accurate. This leads to interesting

avenues to enable predictable IP designs towards high tape-out power estimation accuracy from an

early algorithmic level.

Page 74: Low-power memory controller subsystem IP exploration …
Page 75: Low-power memory controller subsystem IP exploration …

References | 59

References

[1] J. D. Meindl, ‘A history of low power electronics: how it began and where it’s headed’, in Proceedings of 1997 International Symposium on Low Power Electronics and Design, 1997, pp. 149–151. DOI: 10.1109/LPE.1997.621267

[2] J. D. Meindl, ‘Low power microelectronics: retrospect and prospect’, Proc. IEEE, vol. 83, no. 4, pp. 619–635, Apr. 1995. DOI: 10.1109/5.371970

[3] Michael Keating, Ed., Low power methodology manual: for system-on-chip design. New York, NY: Springer, 2007, ISBN: 978-0-387-71818-7.

[4] ‘Semiconductor Engineering - The 7nm Pileup’. [Online]. Available: https://semiengineering.com/the-7nm-pileup/. [Accessed: 08-Sep-2019]

[5] Anantha P Chandrakasan and Robert W Brodersen, Low Power Digital CMOS Design. Boston, MA: Springer US, 1995, ISBN: 978-1-4615-2325-3 [Online]. Available: https://public.ebookcentral.proquest.com/choice/publicfullrecord.aspx?p=3080665. [Accessed: 08-Sep-2019]

[6] ‘A Review Paper on CMOS, SOI and FinFET Technology’. [Online]. Available: https://www.design-reuse.com/articles/41330/cmos-soi-finfet-technology-review-paper.html. [Accessed: 08-Sep-2019]

[7] ‘Semiconductor Engineering - Designing For Ultra-Low-Power IoT Devices’. [Online]. Available: https://semiengineering.com/designing-for-ultra-low-power-iot-devices/. [Accessed: 08-Sep-2019]

[8] ‘Spyglass Power: Complete Solution for Power Optimizations At RTL | Register Form’. [Online]. Available: https://www.synopsys.com/cgi-bin/verification/dsdla/pdfr1.cgi?file=spyglass-power-ds.pdf. [Accessed: 08-Sep-2019]

[9] ‘Static Timing Analysis - PrimeTime’. [Online]. Available: https://www.synopsys.com/implementation-and-signoff/signoff/primetime.html. [Accessed: 08-Sep-2019]

[10] ‘Semiconductor Engineering - Why Chips Die’. [Online]. Available: https://semiengineering.com/why-chips-die/. [Accessed: 08-Sep-2019]

[11] Sanjay Churiwala and Sapan Garga, Principles of VLSI RTL design: a practical guide. New York: Springer, 2011, ISBN: 978-1-4419-9295-6.

[12] Amara Amara, Frédéric Amiel, and Thomas Ea, ‘FPGA vs. ASIC for low power applications’, Microelectron. J., vol. 37, no. 8, pp. 669–677, Aug. 2006. DOI: 10.1016/j.mejo.2005.11.003

[13] H. J. M. Veendrick, ‘Short-circuit dissipation of static CMOS circuitry and its impact on the design of buffer circuits’, IEEE J. Solid-State Circuits, vol. 19, no. 4, pp. 468–473, Aug. 1984. DOI: 10.1109/JSSC.1984.1052168

[14] Ioannis Savvidis, Power Savings in MPSoC. 2009 [Online]. Available: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-147365. [Accessed: 14-Oct-2019]

[15] ‘Power analysis of clock gating at RTL’, Design And Reuse. [Online]. Available: https://www.design-reuse.com/articles/23701/power-analysis-clock-gating-rtl.html. [Accessed: 20-Sep-2019]

[16] Himanshu Bhatnagar, Ed., ‘Asic Design Methodology’, in Advanced ASIC Chip Synthesis Using Synopsys® Design CompilerTM Physical CompilerTM and PrimeTime®, Boston, MA: Springer US, 2002, pp. 1–17 [Online]. DOI: 10.1007/0-306-47507-3_1

[17] ‘Differential Energy Analysis to Optimize Mobile GPU Power’, p. 6. [18] ‘WebPlotDigitizer - Extract data from plots, images, and maps’. [Online].

Available: https://automeris.io/WebPlotDigitizer/index.html. [Accessed: 25-Sep-2019] [19] Spyglass Power User guide from Synopsys, Internal document. [20] PrimeTimePX user guide from Synopsys, Internal document. [21] Dump, convert and replay: A Targeted Methodology to mitigating gate-level power

simulations effort, Ioannis Savvidis, Ericsson AB, Presented at DAC 2019. Internal Document.

[22] S. M. A. H. Jafri, N. Farahini, and A. Hemani, “Silago-cog: Coarse-grained grid-based design for near tape-out power estimation accuracy at high level,” in 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 25–31, IEEE, 2017

Page 76: Low-power memory controller subsystem IP exploration …

TRITA-EECS-EX-2020:458

www.kth.se