THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
Accurate Leakage-Conscious
Architecture-Level Power Estimation
for SRAM-based Memory Structures
MINH QUANG DO
Division of Computer Engineering
Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Göteborg, Sweden 2007
Accurate Leakage-Conscious Architecture-Level Power Estimation for SRAM-based
Memory Structures
Minh Quang Do
ISBN 978-91-7291-968-6
Copyright c© Minh Quang Do, 2007.
Doktorsavhandlingar vid Chalmers tekniska högskola
Ny serie Nr 2649
ISSN 0346-718X
Technical report 31D
Department of Computer Science and Engineering
Embedded and Networked Processor Research Group
Division of Computer Engineering
Chalmers University of Technology
SE-412 96 GÖTEBORG, Sweden
Phone: +46 (0)31-772 10 00
Author e-mail: [email protected]
Printed by Chalmers Reproservice
GÖTEBORG, Sweden 2007
Accurate Leakage-Conscious
Architecture-Level Power Estimation
for SRAM-based Memory Structures
Minh Quang Do
Division of Computer Engineering, Chalmers University of Technology
ABSTRACTFollowing Moore’s Law, technology scaling will continue providing integration
capacity of billions of transistors for IC industry. As transistors keep shrink-
ing in size, leakage power dissipation dramatically increases and gradually be-
comes a first-class design constraint. To provide higher performance at lower
power and energy for micro-architectures, on-chip caches are growing in size
and thus become a major contributor to the total leakage power dissipation in
next-generation processors. In these circumstances, accurate leakage power es-
timation obviously is needed to allow designers to strike a balance between dy-
namic power and leakage power, and between total power and delay in on-chip
caches.
This dissertation presents a modular, hybrid power modeling methodology
capable of capturing accurately both dynamic and leakage power mechanisms
for on-chip caches and for SRAM arrays. The methodology successfully com-
bines the most valuable advantage of circuit-level power estimation – high ac-
curacy – with the flexibility of higher-level power estimation while allowing for
short component characterization and estimation time. The methodology offers
high-level parameterizable, but still accurate power dissipation estimation mod-
els that consist of analytical equations for dynamic power and pre-characterized
leakage power values stored in tables.
In addition, a modeling methodology to capture the dependence of leakage
power on temperature variation, on supply-voltage scaling, and on the selec-
tion of process corners has also been presented. This methodology provides an
essential extension to the proposed power models.
Keywords: VLSI, CMOS, Deep Submicron, Power Estimation, Cache Power Modeling,
SRAM Power Modeling, Power-Performance Estimation Tool, DSP Architecture
ii
Preface
This Ph.D. thesis presents the results of my research work conducted during the
period January 2002 to May 2007. It is based on the following seven papers:
⊲ Paper 1: M. Q. Do, P. Larsson-Edefors and L. Bengtsson,
“Table-based Total Power Consumption Estimation of Memory Ar-
rays for Architects,” in Proceedings of the 14th International Work-
shop on Power and Timing Modeling, Optimization and Simulation
(PATMOS), Isle of Santorini, Greece, Sept. 15–17, 2004, pp. 869–
878.
⊲ Paper 2: M. Q. Do, M. Draždžiulis, P. Larsson-Edefors and
L. Bengtsson, “Parameterizable Architecture-level SRAM Power
Model Using Circuit-simulation Backend for Leakage Calibration,”
in Proceedings of International Symposium on Quality Electronic
Design (ISQED), San Jose, CA, USA, March 27-29, 2006, pp. 557–
563.
⊲ Paper 3: M. Q. Do, M. Draždžiulis, P. Larsson-Edefors and
L. Bengtsson, “Leakage-Conscious Architecture-Level Power Es-
timation for Partitioned and Power-Gated SRAM Arrays,” in Pro-
ceedings of International Symposium on Quality Electronic Design
(ISQED), San Jose, CA, USA, March 26-28, 2007, pp. 185–191,
(best-paper-award nominee).
⊲ Paper 4: M. Q. Do, P. Larsson-Edefors and M. Draždžiulis,
“Capturing Process-Voltage-Temperature (PVT) Variations in Ar-
iii
iv PREFACE
chitectural Static Power Modeling for SRAM Arrays,” Technical
Report No. 2007-06, Department of Computer Science & Engi-
neering, School of Computer Science and Engineering, Chalmers
University of Technology, Göteborg, Sweden, May 2007.
⊲ Paper 5: M. Q. Do, P. Larsson-Edefors and L. Bengtsson,
“Leakage-Conscious Architecture-Level Power Estimation Models
for On-Chip Caches,” manuscript.
⊲ Paper 6: M. Q. Do, P. Larsson-Edefors and M. Draždžiulis,
“Current Probing Methodology for Static Power Extraction in Sub-
90nm CMOS Circuits,” Technical Report No. 2007-07, Depart-
ment of Computer Science & Engineering, School of Computer
Science and Engineering, Chalmers University of Technology, Göte-
borg, Sweden, May 2007.
⊲ Paper 7: M. Q. Do, P. Larsson-Edefors and M. Draždžiulis,
“High-Accuracy Architecture-Level Power Estimation for Parti-
tioned SRAM Arrays in a 65-nm CMOS BPTM Process,” to ap-
pear in Proceedings of 10th Euromicro Conference on Digital Sys-
tem Design, Architecture, Methods and Tools (DSD 2007), Lübech,
Germany, August 27–31, 2007, (invited paper).
The following related papers are not included in this thesis:
⊲ Paper 8: M. Q. Do, L. Bengtsson and P. Larsson-Edefors,
“DSP-PP: A Simulator/Estimator of Power Consumption and Per-
formance for Parallel DSP Architectures,” in Proceedings of the
21st Multiconference in Applied Informatics - Parallel and Dis-
tributed Computing and Networks Symposium (PDCN), Innsbruck,
Austria, Feb. 10–13, 2003, pp. 767–772
⊲ Paper 9: M. Q. Do, L. Bengtsson and P. Larsson-Edefors,
“Models for Power Consumption Estimation in the DSP-PP Sim-
ulator,” in Proceedings of the 1st International Signal Processing
Conference (ISPC), Dallas, Texas, USA, March 31–April 3, 2003.
v
⊲ Paper 10: M. Q. Do and L. Bengtsson, “Analytical Models for
Power Consumption Estimation in the DSP-PP Simulator: Prob-
lems and Solutions,” Technical Report No. 03-22, Department of
Computer Engineering, School of Computer Science and Engineer-
ing, Chalmers University of Technology, Göteborg, Sweden, Au-
gust 2003.
⊲ Paper 11: M. Q. Do, P. Larsson-Edefors and L. Bengtsson,
“Table-based Total Power Consumption Estimation Approach for
Architects,” in Proceedings of the Swedish System-on-Chip Con-
ference, Båstad, Sweden, April 13–14, 2004.
⊲ Paper 12: M. Q. Do, P. Larsson-Edefors and L. Bengtsson,
“Towards a Power and Performance Simulation Framework for Par-
allel DSP Architecture,” in Poster abstracts of 1st International
Summer School on Advanced Computer Architecture and Compi-
lation for Embedded Systems, L’Aquila, Italy, July 24-30, 2005,
pp. 161–164.
⊲ Paper 13: M. Q. Do, M. Draždžiulis, and P. Larsson-Edefors,
“Architecture-Level Power Estimation and Scaling Trends for
SRAM Arrays,” in Proceedings of the Swedish System-on-Chip
Conference, Kålmorden, Sweden, May 4–5, 2006.
vi PREFACE
Acknowledgments
My life has been characterized by “adventures”, both long and short. This dis-
sertation marks the end of an unforgettable adventure that started 5,5 years ago
when I was admitted to the Doctoral Program at the Department of Computer
Science and Engineering, Chalmers University of Technology. Although full of
toil and sweat, it has never been a lonely journey for me since I was blessed to
have wonderful people as my companions.
First and foremost, I owe my greatest gratitude to two of the wised men I am for-
tunate to work with: my research supervisor, Associate Professor Lars Bengts-
son, and my research examiner, Professor Per Larsson-Edefors.
I am very grateful to Lars Bengtsson for accepting me as his PhD student, letting
me do the research my way, constantly backing, encouraging and giving me
invaluable advices, especially in the first half of my PhD study.
I am also indebted to Per Larsson-Edefors for his enthusiasm, constant sup-
ports, encouragements and excellent professional advices. I have been always
impressed and inspired by his sense of professionalism. For me, Per Larsson-
Edefors is not only a research examiner but also an advisor who can work in-
tensively together with his students at very late hours at night, who always has
some new ideas to add and knows several ways to solve it. And who is still in
love with “hard rock” musics at his age today ...
vii
viii ACKNOWLEDGMENTS
I would like to thank Docent Lars “J” Svensson for being a member of my ad-
vising committee and sharing his research experience and profound competence
with me, especially in know-how, know-where questions and research issues.
I want to thank Dr. Daniel Eckerbert for his comments and critical research dis-
cussions on power consumption estimation methodologies and its classification
as well as for his kind helps in Hspice and Cadence designing tools.
Many thanks goes to Firas Milh, a master thesis worker at the Department of
Computer Science and Engineering, for implementing the DSP-PP simulator
using in this thesis.
I have also met many other interesting people along the way who deserve special
thanks from me:
⊲ All recent and former members of the VLSI Research Group: Daniel
Eckerbert, Henrik Eriksson, Mindaugas Draždžiulis, Dainius Ciuplys,
Daniel Andersson, Magnus Själander; thank you my friends for accept-
ing me as an “unofficial” group’s member, for sharing not only your
research experiences, but also your interests in music, fashion, games,
entertainment, etc. thus making my life here not just work and work!
⊲ Thank you, Martin Thuresson for your kind helps and critical discus-
sions in research-related topics like C programming, LATEX, Emacs, UNIX,
etc! I have enjoyed so much your warm friendship, hospitality and your
sharing with me the knowledge on Swedish culture, language, history
and society.
⊲ A special thanks goes to Mindaugas Draždžiulis and Egle Reimontaite
for their warm friendship to me and to my family. An extra thanks is
given to Mindaugas for his helps and cooperations in doing research.
⊲ An extra thanks goes to Magnus Själander and Martin Thuresson for
their helps in proof-reading of this dissertation.
⊲ A special thanks goes to Jochen Hollmann, Djordje Jeremic, Xiao Ming,
ix
Wolfgang John, Raul Barbosa, M. Waliullah, Mafijul Islam, former Ph.D
students (Dr. Fredrik Warg, Dr. Zihuai Lin, Dr. Dhammika Bokolamulla
Dr. Håkan Forsberg, Dr. Kristina Forsberg, Dr. Jim Nilsson, Lic. Eng.
Peter “biff”) and other Ph.D students at the Department of Computer
Science and Engineering for their colleagueship that creates a friendly
and inspiring atmosphere to work at our department.
⊲ I would like to thank Per Waborg for guiding me through university bu-
reaucracy, for being a good “unbeatable” table-tennis opponent-player
and sharing his life experience with me.
⊲ Many thanks to the rest of the colleagues, administrative staff and tech-
nical support at the Department of Computer Science and Engineering
for creating a nice working environment and providing me assistances
in many possible ways.
⊲ I send a lot of thanks to my Vietnamese friends at CTH for their friend-
ship and helps without which my study here would be much more diffi-
cult and less enjoyable.
For the last, but the most deserved, I am grateful to my parents who have always
trusted and encouraged me to reach the heights I wish; to my brothers and
sister for their understanding and encouragements; and especially to my beloved
wife, Tran Thi Thu Ha, and my little angel, Ngoc “Candy”, for their incredible
patience, unflagging supports, great encouragements and endless love to me,
particularly at very difficult moments in my adventures. This work is, therefore,
dedicated to them.
Minh Quang Do
Göteborg, May 2007
x ACKNOWLEDGMENTS
Contents
Abstract i
Preface iii
Acknowledgments vii
I Introduction 1
1 Introduction 3
1.1 Technology Scaling and its Induced Problems . . . . . . . . . . 3
1.2 On-Chip Cache – Trend of Development . . . . . . . . . . . . . 7
1.3 On-Chip Cache – Leakage Power Estimation . . . . . . . . . . 8
1.4 Dissertation Objective and Scope . . . . . . . . . . . . . . . . . 11
1.4.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . 13
1.6 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . 16
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
II Background 19
2 On-Chip SRAM Cache Architecture 21
2.1 Caches for DSP and Embedded Systems . . . . . . . . . . . . . 21
xi
xii CONTENTS
2.1.1 Basic DSP Architectures . . . . . . . . . . . . . . . . . 22
2.1.2 Cache Architectures in DSP and Embedded Systems . . 25
2.2 Caches for GPP Systems . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Cache System Architecture . . . . . . . . . . . . . . . . 28
2.3 Cache on GPPs and DSPs: Differences . . . . . . . . . . . . . . 32
2.4 Cache Organization . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.1 Basic Cache Organization . . . . . . . . . . . . . . . . 33
2.4.2 Memory Partitioning . . . . . . . . . . . . . . . . . . . 38
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Power Dissipation in CMOS 43
3.1 Mechanisms of Power Dissipation . . . . . . . . . . . . . . . . 44
3.1.1 Dynamic Power . . . . . . . . . . . . . . . . . . . . . . 45
3.1.2 Leakage Power . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Trend of Development and Emerging Issues . . . . . . . . . . . 50
3.3 Leakage Power Reduction Techniques . . . . . . . . . . . . . . 53
3.3.1 Power Cut-off Techniques . . . . . . . . . . . . . . . . 54
3.3.2 Leakage-Reduction Techniques for SRAM-based Caches 56
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Cache Power Modeling – Tool Perspective 63
4.1 A Survey of Existing Power-Performance Tools . . . . . . . . . 64
4.2 High-Level Power Estimation Tools for Caches . . . . . . . . . 68
4.3 Power Dissipation Estimation Models . . . . . . . . . . . . . . 69
4.3.1 High-level Power Estimation Methodology . . . . . . . 69
4.3.2 Analytical Models . . . . . . . . . . . . . . . . . . . . 70
4.3.3 Table-based and Equation-based Models . . . . . . . . . 75
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
III Power Modeling for SRAM-based Structures 81
5 Modular Approach to Power Modeling 83
CONTENTS xiii
5.1 Analytical Power Modeling Approach & Problems . . . . . . . 84
5.2 The Proposed Modular Modeling Approach . . . . . . . . . . . 86
5.3 Probing Methodology for Leakage . . . . . . . . . . . . . . . . 89
5.4 Power Models for On-Chip Caches . . . . . . . . . . . . . . . . 93
5.4.1 Power Models for Partitioned Data SRAM Arrays . . . 93
5.4.2 Power Models for Unpartitioned Data SRAM Arrays . . 106
5.4.3 Power Models for SRAM-based Tag Arrays . . . . . . . 107
5.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5.1 Validation Methodology . . . . . . . . . . . . . . . . . 112
5.5.2 Validation of Power Models for Data SRAM Arrays . . 115
5.5.3 Validation of Power Models for SRAM-based
Tag Arrays . . . . . . . . . . . . . . . . . . . . . . . . 121
5.6 Thermal and Variability Issues . . . . . . . . . . . . . . . . . . 124
5.6.1 Modeling the Dependence of Leakage on Temperature . 124
5.6.2 Modeling Leakage with Variation in Supply Voltage . . 127
5.6.3 Modeling the Dependence of Leakage on Process
Corner . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6 Conclusion and Future Work 133
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
IV Appendix 137
A DSP-PP Simulator 139
A.1 Characteristics of DSP Architectures . . . . . . . . . . . . . . . 140
A.2 DSP-PP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
A.2.1 Features of the DSP-PP . . . . . . . . . . . . . . . . . . 142
A.2.2 Description of the DSP-PP Simulator (Version 2.0) . . . 144
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
xiv CONTENTS
List of Figures
1.1 The Original Moore Law. (Source: Intel Museum [2]) . . . . . 4
1.2 Moore’s Law as illustrated by the transistor count per IC for
Intel microprocessors from the 4004 to the Itanium 2 (9 MBytes
cache). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Dynamic and leakage power trend as predicted by ITRS (from
[7]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 The die photo of a) Intel’s Madison Processor (374 mm2). b)
Intel’s Pentium M Processor (84 mm2). (Source: Intel Press-
room [2]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Total leakage power as function of min L and tox for a 6T-
SRAM cell (BPTM 32-nm @ Vdd = 1.1 V) . . . . . . . . . . 10
2.1 Basic DSP Architectures: Harvard architecture . . . . . . . . . 22
2.2 Basic DSP Architectures: a) with a MUX, and b) with a MUX
and a small instruction cache . . . . . . . . . . . . . . . . . . 23
2.3 The most frequently used DSP Architecture . . . . . . . . . . 24
2.4 A VLIW-DSP Architecture with Multiple Datapaths . . . . . . 24
2.5 A Two-level Cache Architecture (TI TMS320C6211/TI C6x
DSP [5]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 A typical memory hierarchy in GPPs . . . . . . . . . . . . . . 29
2.7 A two-level cache architecture used in GPPs . . . . . . . . . . 30
2.8 A three-level cache architecture used in GPPs . . . . . . . . . 30
2.9 Block diagram of the Intel Itanium 2 processor [14] . . . . . . 31
xv
xvi LIST OF FIGURES
2.10 Basic organization of a Direct-mapped cache [7] . . . . . . . . 35
2.11 Basic organization of a typical SRAM-based cache . . . . . . . 36
3.1 Leakage mechanisms in an off-state NMOS transistor with
VG = VS = 0 and VD = Vdd . . . . . . . . . . . . . . . . . . . 44
3.2 EOT and gate leakge density scaling for extended planar bulk
CMOS devices (ITRS 2006) . . . . . . . . . . . . . . . . . . . 52
3.3 Scaling in subthreshold leakage for extended planar bulk CMOS
devices (ITRS 2006) . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Gate length scaling for extended planar bulk CMOS devices
(ITRS 2006) . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Leakage current paths in the SCCMOS technique (from [12]) . 54
3.6 Leakage current paths in the ZSCCMOS technique (from [12]) 55
3.7 Leakage current paths in the GSCMOS technique (from [12]) . 56
5.1 Subthreshold leakage power with different temperature for an
NMOS transistor (commercial 130-nm process) . . . . . . . . 85
5.2 Power modeling methodology: a) Component Characteriza-
tion Phase, and b) Power Estimation Phase . . . . . . . . . . . 87
5.3 Current measurement for MOS transistors used in Hspice sim-
ulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 Block diagram of a partitioned SRAM array using DWL and
DBL techniques . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Organization of a sub-array . . . . . . . . . . . . . . . . . . . 96
5.6 (a) Characterization of a 6T-SRAM cell, (b) Hspice configura-
tion for VLBL estimation . . . . . . . . . . . . . . . . . . . . 98
5.7 Subthreshold (green, solid) and gate leakage (red, dotted) cur-
rents in a partitioned 6T-SRAM cell . . . . . . . . . . . . . . . 100
5.8 Characterization of (a) a sense amplifier, (b) a write circuit . . 101
5.9 Architecture of a 8-256 row decoder . . . . . . . . . . . . . . 104
5.10 The structure of a typical Ntag-bits NOR-based comparator . . 109
5.11 Total power dissipation of 8-KB data arrays [blue/grey — 8A,
brown/black — 8B, yellow/white — 8C] . . . . . . . . . . . . 115
LIST OF FIGURES xvii
5.12 Total power dissipation of 2-KB data arrays [blue/grey — 2A,
brown/black — 2B] . . . . . . . . . . . . . . . . . . . . . . . 116
5.13 Accuracy in estimating: a) dynamic power, b) leakage power,
c) total power for 8-KB data arrays [blue/grey—8A, brown/black—
8B, yellow/white—8C] . . . . . . . . . . . . . . . . . . . . . 118
5.14 Accuracy in estimating: a) dynamic power, b) leakage power,
c) total power for 2-KB data arrays [blue/grey—2A, brown/black—
2B] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.15 The proportion of dynamic (in brown/black) and leakage (in
blue/grey) power in the 8A, 8B and 8C arrays. . . . . . . . . . 120
5.16 The proportion of dynamic (in yellow) and leakage (in orange)
power in the 2A array. The proportion of dynamic (in blue)
and leakage (in brown) power in the 2B array. . . . . . . . . . 120
5.17 Accuracy in estimating: a) dynamic power, b) leakage power
for a 2-KB SRAM-based tag array . . . . . . . . . . . . . . . 122
5.18 Total power dissipation of a 2-KB partitioned SRAM-based
tag array (blue/grey) and a 2-KB partitioned data array (brown/black)123
5.19 Subthreshold leakage power as a function of temperature for a
6T-SRAM cell (commercial 130-nm) . . . . . . . . . . . . . . 126
5.20 Gate and subthreshold leakage power as functions of Vdd for a
6T-SRAM cell (BPTM 65-nm [20]) . . . . . . . . . . . . . . . 128
5.21 The subthreshold leakage power’s dependence on temperature
for a 6T-SRAM cell (commercial 130-nm with process cor-
ners: SS, TT, FF) . . . . . . . . . . . . . . . . . . . . . . . . 129
A.1 Block Diagram of the DSP Power Performance Simulator . . . 143
A.2 Interconnection of components inside a SP of the extended
ManArray architecture [3] . . . . . . . . . . . . . . . . . . . . 147
A.3 Interconnection of components inside a PE of the extended
ManArray architecture [3] . . . . . . . . . . . . . . . . . . . . 148
A.4 The GUI of our implemented DSP-PP simulator . . . . . . . . 151
xviii LIST OF FIGURES
List of Tables
4.1 The equations for capacitance of critical nodes . . . . . . . . . 73
5.1 Organization parameters for partitioned SRAM arrays . . . . . 95
5.2 Organization parameters for partitioned SRAM-based tag arrays108
xix
xx LIST OF TABLES
Part I
Introduction
1Introduction
1.1 Technology Scaling and its Induced Problems
In the beginning, complementary metal-oxide semiconductor (CMOS) technol-
ogy was chosen because it dissipated much less power than earlier technologies
such as transistor-transistor logic (TTL) and emitter-coupled logic (ECL). It
was, in fact, true at that time because when not switching MOS transistors dis-
sipate negligible power for the clock frequency in the kHz range. However, as
device switching frequency and chip integration density keep increasing, power
dissipation increases dramatically. Observing the trend of CMOS device inte-
gration, Gordon Moore of Intel, in 1965 gave his most famous prediction that
the number of devices on an IC will double every 12 months (later revised to
every 24 months) [1] often referred to as Moore’s Law. This observation, after
3
4 CHAPTER 1. INTRODUCTION
several revisions, is still pretty much true, it has served and continues to be a
driving force for CMOS technology, silicon industry, and personal computer
(PC) manufacturing industry. Figure 1.1 shows the original graph drawn by
Gordon Moore when he published his observation in 1965, whereas Figure 1.2
shows the development of the transistor count for Intel microprocessors from
the 4004 to the Itanium 2 (the version with 9 MBytes cache) as an illustration
of Moore’s Law exercising in real life.
Figure 1.1: The Original Moore Law. (Source: Intel Museum [2])
Along with technology scaling many new design challenges have emerged;
performance and power dissipation are two major issues of computer system
design, among those the latter one has been recognized by the processor design
community as a first-class architectural design constraint not only for portable
computers, mobile communication devices, but also for high-end systems, e.g.
superscalar, single, multiprocessor, multi-core and high-performance embed-
ded processor systems [3].
1.1. TECHNOLOGY SCALING AND ITS INDUCED PROBLEMS 5
1970 1975 1980 1985 1990 1995 2000 200510
0
101
102
103
104
105
106
Year
Th
ou
sa
nd
s o
f T
ran
sis
tors
4004 8008
8080
8086
286
386
486
Pentium
Pentium II
Pentium III
Pentium 4
Itanium
Itanium 2
Itanium 2 (9MB cache)
Figure 1.2: Moore’s Law as illustrated by the transistor count per IC for Intel micro-
processors from the 4004 to the Itanium 2 (9 MBytes cache).
While low power is important, achieving the lowest power solution alone
is obviously not the primary goal. First and foremost, the system design must
meet performance and feature requirements of the application. Success ulti-
mately lies in the ability to strike the optimum balance between performance,
power and cost. In order to achieve these design goals there is a need to develop
power-performance estimation tools that can aid designers to model entire sys-
tems as well as every system component, to perform the power-performance
evaluation and tradeoff analysis.
6 CHAPTER 1. INTRODUCTION
Moreover, as a result of CMOS technology scaling, leakage power dissipa-
tion has become a significant portion of the total power consumption in deep-
submicron VLSI chip [4]. The International Technology Roadmap for Semi-
conductors (ITRS [5]) predicts that in a few years, the total leakage power of a
chip may exceed the total dynamic power, and the projected increases in sub-
threshold leakage (Figure 1.3) shows that it will exceed total dynamic power
dissipation as technology drops below 65-nm feature size [6]. As leakage con-
tinues to increase in importance, accurate leakage power estimation is needed
to allow designers to make good design trade-offs. This is especially true at
higher design levels that are associated with a higher degree of design freedom,
potentially leading to higher power savings.
Figure 1.3: Dynamic and leakage power trend as predicted by ITRS (from [7])
1.2. ON-CHIP CACHE – TREND OF DEVELOPMENT 7
1.2 On-Chip Cache – Trend of Development
Although leakage power dissipation is an issue for all processor circuit compo-
nents, it is a particularly important problem in on-chip caches that have large
sections that are idle for relatively long periods of time. This is due to three rea-
sons [8]: (i) increasing sub-threshold leakage current due to technology scaling;
(ii) leakage energy increases with the effective number of transistors in the cir-
cuits; (iii) a large transistor budget is allocated for on-chip caches in current
processors.
Figure 1.4: The die photo of a) Intel’s Madison Processor (374 mm2). b) Intel’s Pen-
tium M Processor (84 mm2). (Source: Intel Pressroom [2])
In recent years, in order to minimize the latency and to improve memory
bandwidth larger L1, L2, and even L3 caches are being integrated on die,
thanks to the advanced ability in integration offered by recent submicron and
DSM, VDSM CMOS processes. For example, the Alpha 21464 processor has
128 KBytes L1 and 1.5 MBytes L2, Intel’s Madison processor has 1 MBytes L2
and 6 MBytes L3, Intel’s Pentium M (Centrino) processor has 2 MBytes L2, and
8 CHAPTER 1. INTRODUCTION
Dual-Core Multi-Threaded Xeon processor has 2 MBytes L2 and 16 MBytes
L3 on-chip caches, respectively. Figure 1.4 shows die photos of Intel’s Madi-
son and Pentium Centrino processors, respectively. In these processors, on-chip
caches occupy more than 50% of the die area. This trend eventually makes on-
chip caches to be one of the major components-contributors to the total leakage
power dissipation of microprocessors.
1.3 On-Chip Cache – Leakage Power Estimation
As leakage does continue to increase in importance, accurate leakage power
estimation is needed to allow designers to strike a balance between dynamic
power and leakage power, and between total power and delay in the on-chip
caches.
Since all leakage mechanisms are closely related to the physical behavior
of MOS transistors, the type of circuitry involved and the process technology
parameters, a straight-forward way to model it is to use equations and sets of
parameters to describe those complex behaviors of MOS transistors. This mod-
eling way is referred to as the analytical approach. The complexity of equa-
tions defines the accuracy in estimating leakage power. BSIM4 models leakage
mechanisms using very detailed and complex equations [9], for example BSIM4
models sub-threshold leakage current for a MOS transistor using the following
equations (refer to [9] for more details):
Isub = I0 (1 − e−VdsVth ) e
−VT −VoffnVth (1.1)
where,
VT = VTH0 + δNP (∆VT,body_effect − ∆VT,charge_sharing
−∆VT,DIBL + ∆VT,reverse_short_channel + ∆VT,narrow_width
+∆VT,small_size − ∆VT,pocket_implant)
I0 = µW
LV 2
th
√
q ǫsi NDEP
2φs; Vth =
kB T
q(1.2)
1.3. ON-CHIP CACHE – LEAKAGE POWER ESTIMATION 9
Here, q is the electrical charge, T is the varying temperature, n is the sub-
threshold swing coefficient, kB is the Boltzmann constant, NDEP is the chan-
nel doping concentration, φs is the surface potential, ǫsi is the dielectric con-
stant of silicon, µ is the carrier mobility, Vth is the thermal voltage, Vds is the
drain-source voltage, Voff is the offset voltage, W is the width, and L is the
length. VT is the device threshold voltage defined by a very complex expres-
sion: VTH0 is the threshold voltage of a long-channel device at zero bias, and
∆VT,body_effect, ∆VT,charge_sharing , ∆VT,DIBL, ∆VT,reverse_short_channel,
∆VT,narrow_width, ∆VT,small_size, ∆VT,pocket_implant are body-effect, charge-
sharing, DIBL, reverse-short-channel, narrow-width, small-size, and pocket-
implant effects on VT respectively. δNP is defined as +1 for NMOS and −1 for
PMOS.
Eqs 1.1 - 1.2 show how complicated it is to calculate the value of sub-
threshold leakage current for a MOS transistor analytically. Thus, although
BSIM4 models offer high accuracy in estimating leakage power they are ob-
viously not suitable for higher-level power estimation due to their complex re-
lations and equations that require the user to have deep knowledge of device
models and access to detailed process parameter.
During the past decade, a fair amount of research effort has been directed
towards developing high-level power-performance tools for on-chip caches. To
avoid the complexity of an analytical approach there are efforts to simplify
BSIM3 or BSIM4 analytical equations to some degree of complexity accept-
able to be used in higher-level power estimation tools. However, those simpli-
fied analytical leakage power models still suffer serious drawbacks in estimat-
ing leakage power: inaccuracy and inflexibility. One of the most widely used
power estimation tools in the public domain is CACTI [10] that offers analytical
timing and energy models for partitioned caches. In its previous versions 1.0,
2.0 and 3.2, CACTI used only ideal first-order scaling for technology trends.
Further, it did not include any leakage power models. The PRACTICS tool [11]
uses analytical models to determine an optimal design for partitioned caches
10 CHAPTER 1. INTRODUCTION
by performing an exhaustive comparison of alternative memory configuration
parameters. Although PRACTICS provides more accurate estimates of inter-
connect effects in comparison to CACTI 3.2, it still does not include power
models for leakage estimation.
The recently released CACTI version (4.0 [12]) is updated with respect
to basic circuit structures, to device parameters for an improved technology
scaling, and to leakage models, in that a model based on Hotleakage [13] and
eCACTI [14] is added. However, the added model still fails to accurately ac-
count for small channel effects, for gate leakage, and for terminal voltage de-
pendencies in transistor stacks—the model error in estimating leakage power
dissipation is claimed to be 21.5% [14]. However, as the concept of a tech-
nology node, according to ITRS’05 [5], gradually is abandoned, using a typical
process may yield large estimation errors for static-power dominated memories.
If leakage power models at architectural level are to guide design trade-offs,
they need to be calibrated to one or several target processes.
3.23.3
3.43.5
3.6x 10−8
1.55
1.6
1.65
x 10−9
1
2
3
x 10−7
tox
L
Ileak
47
1.6
0.55
16.4
Isleak
/Igleak
Figure 1.5: Total leakage power as function of min L and tox for a 6T-SRAM cell (BPTM
32-nm @ Vdd = 1.1 V)
1.4. DISSERTATION OBJECTIVE AND SCOPE 11
Sub-threshold leakage still remains the main contributor to total leakage,
however other mechanisms such as gate oxide tunneling and junction (BTBT)
leakage are of increasing significance. To predict which leakage mechanism
will dominate in the future is difficult, since there is a complex interaction of
technology and circuit development. In a recent study on power dissipation for
nanometer caches [15], a surprising trend for sub-threshold and gate leakage
power components was outlined: For example, for a 32-nm Berkeley Predictive
Technology Model (BPTM) [16] process, the sub-threshold contribution dom-
inated gate leakage by approximately 30× (Figure 6 in [15]). In Figure 1.5
the dependence of total leakage power, i.e. the sum of gate (Igleak) and sub-
threshold (Isleak) leakage power, on minimum L and tox, respectively, is plot-
ted for a 6T-SRAM cell in that particular 32-nm BPTM process [16]. Assuming
the default minimum L for transistors suggested in a predictive process clearly
is a poor design compromise. This example also serves to show how important
it is for leakage estimation to capture not only what general technology is used,
but also which circuit design context is used. Since the relative significance of
leakage mechanisms will vary on design context, only leakage estimation based
on data calibrated to target libraries can be trusted.
1.4 Dissertation Objective and Scope
1.4.1 Objective
With the motivations mentioned in Sections 1.1 - 1.3 the objective of this dis-
sertation is to solve, partially or completely, the following problems:
1. To provide designers with a modeling methodology to capture accurately
all leakage mechanisms and dynamic power dissipation for on-chip caches
and SRAM arrays. The methodology needs to explore not only the most
valuable advantage of circuit-level power estimation – high accuracy,
but also the flexibility of higher-level power estimation. Moreover, the
methodology needs to be simple and generic enough so that designers
12 CHAPTER 1. INTRODUCTION
can use it to generate power models for their on-chip caches and SRAM
arrays of interest with different configurations and organizations.
2. To provide users with accurate parameterizable power estimation models
that require low complexity and less computation time in estimating both
leakage and dynamic power dissipation for on-chip caches and SRAM
arrays. These power models needs to be extendible and implementable in
architecture-level power estimation tools.
3. To provide the compatibility between the proposed power models to other
power models that are implemented in existing power simulation tools
for enabling capability of updating those models to the better and more
accurate models.
4. To outline and design an architecture-level power dissipation estimation
tool for DSP using the new cache and SRAM power models (i.e. the
“DSP-PP”).
1.4.2 Scope
In this dissertation, to limit the scope of the research, CMOS circuit technology
has been assumed and all on-chip caches are assumed being implemented using
deep-submicron (DSM) and very-deep-submicron (VDSM) CMOS processes.
Depending on where it is located in the memory hierarchy on-chip caches
have different organization and organizing parameters (e.g. cache size, block
size, word size, associativity). For example, L1 cache often has small block
size and word size is usually equal to the size of the data bus, whereas L2/L3
caches tend to have larger block and data-word size, and smaller associativity.
The organization of on-chip caches also depends on what type of applications
and systems they are used for, i.e. the application domain. Caches for DSP and
embedded systems are organized differently in comparison to caches for GPP
microprocessor system. So, there are more than few options to chose from when
selecting the cache organization for use in this dissertation. To serve as a basic
1.5. DISSERTATION CONTRIBUTIONS 13
platform for modeling power dissipation direct-mapped L1 caches with small
size (i.e. 2 KBytes - 8 KBytes), small block size (i.e. 4) and small data-word
length (i.e. 4 Bytes) has been selected. The reason for this selection is that if
the power models for the selected cache are successfully obtained there would
not be any fundamental problems to extend these power models for the selected
L1 cache with small size and data-word length to the power models for L1/L2
caches with higher associativity, bigger size and bigger data-word length, i.e. in
other words the obtained power models are fully-extendible.
To further limit the scope of the research, on-chip SRAM-based caches con-
sisting of tag 6T-SRAM-based arrays and data 6T-SRAM-based arrays with
regular structures have been assumed. Both tag and data arrays are physically
partitioned into sub-arrays using divided-bit-line (DBL) and divided-word-line
(DWL) techniques within a memory bank. Partitioning a cache into banks is
done in a higher-level than the physical partitioning, and it is normally applied
for highly-associative caches. According to the discussion given in [12], in
practice most users expect multiported multi-bank caches to first synthesize de-
pendent ports from independent banks, and only multi-port the bank themselves
if the required number of ports exceeds the number of banks. Thus, for simplic-
ity, memory banks with single read/write ports have been assumed to be used
in this dissertation.
1.5 Dissertation Contributions
The main contributions of this dissertation are:
1. To propose a modular hybrid power estimation modeling methodology
for on-chip caches and SRAM arrays. The proposed modeling methodol-
ogy is capable of capturing accurately both dynamic and leakage power
mechanisms for on-chip caches and SRAM arrays. Also, the proposed
modeling methodology is simple and straight-forward, allowing for short
component characterization and estimation time. Rather than using only
14 CHAPTER 1. INTRODUCTION
one technique to estimate power dissipation, the proposed methodology
seeks to find the best match between a particular estimation technique
and a specific cache component. For example, a probabilistic approach
has been used to estimate both dynamic and static power of address de-
coders, an analytical approach has been used to estimate dynamic power
of bitlines and 6T-SRAM cells, sense amplifiers, write circuits, and word-
line drivers, while a circuit-simulation-based modeling backend has been
used to estimate all leakage power mechanisms. Furthermore, the pro-
posed modeling methodology is modular, thus, it can be applied to model
power dissipation for the other type of components of regular structures,
e.g. content-addressable-memory (CAM).
The initial idea of the modeling methodology has been discussed in Pa-
per 1, where the White-box Table-based Total Power Consumption es-
timation approach (WTTPC) is introduced. Further development on the
idea of WTTPC approach leads to the formation of the modeling method-
ology for unpartitioned data SRAM arrays that is then fully described
in Paper 2. The modeling methodology for physically partitioned data
SRAM arrays is developed and described in Paper 3. And finally, the
modeling methodology for on-chip caches is invented and described in
details in Paper 5.
2. To offer high-level parameterizable, but still accurate power dissipation
estimation models for on-chip caches and SRAM arrays. For each cache
component, its power model for total power estimation consists of ana-
lytical equations for dynamic power and pre-characterized leakage power
values. Different cache components are characterized by performing
few, simple circuit-level DC simulations using the appropriate probes,
to extract the leakage power from simulation data. Dynamic analytical
power models are derived based on the well-known activity-based switch-
ing power equation, with nodal capacitances extracted using a circuit-
level simulator that establishes the operating point and DC capacitances.
The total leakage power accounts for all types of leakage currents that
1.5. DISSERTATION CONTRIBUTIONS 15
are present in the transistor models used by circuit simulators, during
both idle and active cycles. Therefore, the proposed power models offer
much better accuracy and flexibility in estimating both total and leak-
age power dissipation for on-chip caches and SRAM arrays comparing to
those high-level analytical power models implemented in existing power
estimation tools.
In Paper 2, the component characterization for leakage power and ca-
pacitance extraction for all components of an unpartitioned data SRAM-
based array is described in details, power models are also clearly ex-
plained and presented. In Paper 3, power models for components of
a physically partitioned data SRAM-based array are explained and pre-
sented. Power models for on-chip cache components, including tag and
data SRAM-based arrays, are described in Paper 5.
3. To provide the verification of the proposed power estimation models for
a number of on-chip cache configurations implemented in 0.13-µm and
65-nm CMOS processes. The validation results for on-chip cache is
shown in Paper 5, whereas the validation results for unpartitioned and
partitioned data SRAM-based arrays are given in Paper 2 and Paper 3
and Paper 7, respectively. The obtained accuracy in those validations
are high (more than 95%) compared to the power values of circuit-level
simulations.
4. To propose a modeling methodology to capture the dependence of leak-
age power on temperature variation, on supply-voltage scaling, and on the
selection of process corners for accurate architectural-level power estima-
tion of on-chip caches. The modeling methodology extends the obtained
earlier power models for cache components to capture the dependence of
leakage power on variability issues. The proposed modeling methodol-
ogy and power models are described in Paper 4.
5. To separate all leakage mechanisms existing in on-chip caches and en-
sure ability to capture it correctly using appropriate probing strategy and
16 CHAPTER 1. INTRODUCTION
a circuit-level simulator. Initially, a description of major leakage mech-
anisms for components of a 6T-SRAM-based data array and the probing
strategy to capture them is given in Paper 2. Later, a methodology for
probing circuits for static current measurements in CMOS circuits during
simulation has been proposed. The methodology is capable of capturing
all leakage mechanisms existing in BSIM4 models, in this case imple-
mented in the Hspice simulator. The full description of the methodology
is given in Paper 6. The proposed probing methodology was used suc-
cessfully to obtain accurate and distinguishable static power constituents
(i.e. gate, subthreshold and total leakage power) for 2-kB unpartitioned
and partitioned SRAM memory arrays implemented in a BPTM 65-nm
process, Paper 7.
6. To create a framework and a design for implementing a cycle-accurate
architecture-level performance-power estimation tool for parallel DSP ar-
chitectures (DSP-PP). The structure and design of DSP-PP simulator is
described in Paper 1.
1.6 Dissertation Overview
The remainder of this dissertation is organized as follows. Chapters 2 - 4 pro-
vide readers with background and theory. Chapter 2 focus on the information
of on-chip SRAM-based cache architectures used in DSP, embedded, and GPP
systems, respectively. Chapter 3 reveals the mechanisms behind power dissi-
pation of MOS transistors, and provides several useful techniques to combat
power dissipation. Chapter 4 provides information of the power estimation
models that are implemented in existing power estimation tools for on-chip
caches and other processor components. This chapter also presents some back-
ground information of power modeling in general, its classification and the area
of applications.
BIBLIOGRAPHY 17
Chapter 5 accounts for the work done on modeling methodology for on-chip
caches and SRAM arrays. First, a discussion shows in more details drawbacks
of an analytical approach to power modeling, and the reason why the table-
based simulation-based power modeling approach has been selected. After that,
our modular hybrid power estimation modeling methodology for on-chip caches
and SRAM data arrays is described in details. The followed section is dedicated
to validation of the obtained power models against circuit-level simulations for
complete on-chip caches and data SRAM arrays. After this section, the mod-
eling methodology to capture the dependence of leakage power on temperature
variation, on supply-voltage scaling, and on the selection of process corners is
presented and discussed in detail.
The appendix A presents work done on design and implementation of a
cycle-accurate architecture-level performance-power estimation tool for paral-
lel DSP architectures (DSP-PP) as a case study. This serves as an example of an
application where our proposed power modeling methodology and power mod-
els for on-chip caches and data SRAM arrays can be implemented.
The final part of the dissertation is then ended by conclusions and some
ideas for future works for modeling power dissipation for other type of compo-
nents, e.g. CAM, and clocking networks.
Bibliography
[1] G. E. Moore, “Cramming More Components onto Integrated Circuits,” Electron-
ics, vol. 38, no. 8, Apr. 1965.
[2] http://www.intel.com, 2007.
[3] T. Mudge, “Power: A First Class Design Constraint,” IEEE Transaction on Com-
puters, vol. 34, no. 4, pp. 52–58, Apr. 2001.
[4] S. Borkar, “Design Challenges for Technology Scaling,” IEEE Micro, vol. 19, no.
4, pp. 23–29, Aug. 1999.
18 CHAPTER 1. INTRODUCTION
[5] International Technology Roadmap for Semiconductors, http://public.itrs.net,
ITRS, 2006.
[6] B. Doyle et al., “Transistor Elements for 30-nm Physical Gate Lengths and Be-
yond,” Intel Technology Journal, vol. 6, pp. 42–54, May 2002.
[7] Nam Sung Kim, K. Flautner, D. Blaauw, and T. Mudge, “Circuit and Microarchi-
tectural Techniques for Reducing Cache Leakage Power,” IEEE Transaction on
VLSI Systems, vol. 12, no. 2, pp. 167–184, February 2004.
[8] L. Li et al., “Leakage Energy Management in Cache Hierarchies,” in Proceedings
of PACT’02, Sept. 2002, pp. 131–140.
[9] Univ. California Berkeley Device Group, BSIM4.2.1 MOSFET Model: User’s
Manual, Dept. of EECS, Univ. of California, Berkeley, CA 94720, USA, 2002.
[10] S.J.E. Wilton et al., WRL 93/5: An Enhanced Access and Cycle Time Model for
On-chip Caches, WRL, 1994.
[11] A. Y. Zeng et al., “Cache Array Architecture Optimization at Deep Submicron
Technologies,” in ICCD 2004, Oct. 2004, pp. 320–5.
[12] D. Tarjan et al., HPL 2006-86: CACTI4.0, HP, 2006.
[13] Y. Zhang et al., CS 2003-05: HotLeakage : A Temperature-Aware Model of Sub-
threshold and Gate Leakage for Architects, Dept. of CS, Univ. of Virginia, USA,
2003.
[14] M. Mamidipaka et al., CECS 04-28: eCACTI: An Enhanced Power Estimation
Model for On-chip Caches, CECS, Univ. of California, Irvine, USA, 2004.
[15] S. Rodriguez et al., “Energy/Power Breakdown of Pipelined Nanometer Caches
(90nm/65nm/45nm/32nm),” in ISLPED 2006, oct 2006, pp. 25–30.
[16] W. Zhao et al., “New generation of Predictive Technology Model for sub-45nm
design exploration,” in ISQED 2006, March 2006, pp. 585–90.
Part II
Background
2On-Chip SRAM Cache Architecture
In this chapter, background information of on-chip SRAM-based cache archi-
tectures is provided. Section 2.1 presents the available on-chip cache architec-
tures used in Digital-Signal-Processing (DSP) and embedded systems, whereas
Section 2.2 focuses on the on-chip cache architectures used in General-Purpose-
Processor (GPP) system.
2.1 Cache Architecture Used in DSP and Embed-
ded Computer Systems
This section is started by showing some very basic DSP architectures given in
Figs 2.1, 2.2, 2.3 and 2.4. Later, it shows how caches were introduced and in-
tegrated into some of those basic DSP architectures creating high-performance
21
22 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE
DSP processors capable of meeting the rapidly increased demands posed by
high-performance DSP applications.
2.1.1 Basic DSP Architectures
In the classical von Neumann architecture the arithmetic-logic unit (ALU) and
the control unit (CU) are connected to a single memory that stores both the data
values and program instructions. This architecture is very simple and it was
used when memory was very expensive to build. The main drawback of this
architecture is the bottleneck of the memory system.
DatapathData
Memory
InstructionProcessor
ProgramMemory
Addresses
Data
Instruction
Addresses
Instructions
OP
CodeStatus
Figure 2.1: Basic DSP Architectures: Harvard architecture
Fig. 2.1 shows the classical Harvard architecture. It is an improved architec-
ture compared to the von Neumann’s one. Two separated memories are used to
store data (i.e. Data Memory - DM) and program (i.e. Program Memory - PM),
and two separated busses are also used to connect data and program memories
to the Datapath (DP) and to the Instruction Processor (IP), respectively. This
simple architecture is still used in many micro-controllers, but it is not used in
any recent DSPs [1].
Since the most common operation in digital signal processing is the convo-
lution that is implemented by several multiply and add steps, a DSP processor
must be able to efficiently perform multiply-and-accumulate operations, e.g.
2.1. CACHES FOR DSP AND EMBEDDED SYSTEMS 23
by using Multiply-And-aCcumulate (MAC) units. Ideally, each multiply-and-
accumulate operation should be performed in a single instruction cycle which
requires at least two values be read from and one value be written to the data
memory, while two or more address registers must be updated. Thus, it is ob-
vious that high memory bandwidth is just as important as a fast multiply-and-
accumulate operation [2].
(b)
DM
IP
DP
PM, DM
M
U
X
C a c
h e
DM
IP
DP
PM, DM
M
U
X
(a)
Figure 2.2: Basic DSP Architectures: a) with a MUX, and b) with a MUX and a small
instruction cache
Fig. 2.2 shows two other DSP architectures that were gradually improved
from the Harvard one to provide multiple accesses to DM. In these architec-
tures, the program memory also can be used as a coefficient (data) memory
when executing a convolution. A multiplexer (MUX) is used to provide ac-
cesses to DM and PM when needed. In the architecture shown in Fig.2.2b, a
small cache is added to store a short program. This cache is used when PM
is required for data accessing supporting non-overlap hardware loops including
multiple instructions. However, these architectures require dual or multiple-
ported program memory, thus raising their design cost. Besides, the clock rate
is limited by the memory access rate and therefore can not be very high.
Fig. 2.3 shows the most frequently used simple DSP architecture with a sin-
gle DP. Two data memories are used to support convolution and vector-based
algorithms.
24 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE
DM
IP
DP
DM
M
U
X
PM
Figure 2.3: The most frequently used DSP Architecture
DP
DM
DP
DMDM
DP DP
DM
DMA
PM
IP
MainMemory
Control Signals
Multiple
InstructionsSwitch Network for Multiplexing
Figure 2.4: A VLIW-DSP Architecture with Multiple Datapaths
Fig. 2.4 shows a typical VLIW architecture, an example of a DSP architec-
ture with multiple datapaths. The VLIW-DSP architecture allows multiple in-
structions to be fetched and executed in parallel. Those instructions are decoded
in the IP and then control signals are supplied to multiple datapaths. Parallel ex-
ecution of multiple arithmetic operations in DP requires multiple DMs to store
coefficients and results. VLIW DSP typically assumes that data dependencies
are known and therefore manages data dependencies during compile time [1].
2.1. CACHES FOR DSP AND EMBEDDED SYSTEMS 25
2.1.2 Cache Architectures in DSP and Embedded Systems
Traditionally, DSP system architectures do not have any caches [3]. Instead,
they rely on multiple banks of fast on-chip addressable SRAM memories and
multiple bus sets to allow for several memory accesses per instruction cycle
(Figs 2.1, 2.2a, and 2.3). The on-chip addressable SRAMs are designed to be
accessible by both the central processing unit (CPU) and the direct memory ac-
cess unit (DMA) [4]. However, caches are increasingly used in DSPs for storing
instructions and data required by large, high-performance and memory-hungry
DSP applications. In the beginning, a small specialized instruction cache was
incorporated in some DSP processors to store instructions of small loops so
that the on-chip bus sets can be free to retrieve data (Fig. 2.2b). Later, on-chip
multi-level caches were commonly used on some general purpose DSP fami-
lies, e.g. the Texas Instruments (TI) TMS320C6211 and TI C6x DSP [5]. The
main reasons to have caches in DSP architectures are:
• High-performance DSP applications increasingly require processing ca-
pability from DSP processors which in turn imposes harsh demands of
increased operating frequency and bandwidth on the memory system.
• The frequency of on-chip SRAM memories traditionally used in DSPs
does not scale along with DSP clock rate, and as a result only relatively
small memory sizes are able to meet the frequency goals. This is in di-
rect contrast to the increasing program size requirements seen by DSP
applications, which requires even the larger on-chip SRAM.
• Advanced process technologies have allowed both the CPU speed to in-
crease and more memory to be integrated on-chip, but the access time of
on-chip memory has not increased proportionally. Therefore, the mem-
ory often becomes a processing bottleneck. Besides, large on-chip SRAM
memory is also expensive to build.
• Caching offers a hardware-managed and user-transparent view of a large
address space in a physically small, local SRAM and narrows the per-
formance gap between processor and main memory. Therefore, the in-
26 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE
troduction of multi-level cache systems to DSP architectures can greatly
reduces the CPU-to-memory processing bottleneck while still maintain-
ing the DSP goals of low cost and low power.
CPUL2 Cache CTL
and
L2 SRAM CTL
I-Cache CTL
L1 I-CacheSRAMs
D-Cache CTL
L1 D-CacheSRAMs
DMA Logic
64 64 512
8 instructions 256
256
256
256
256
External DMA
Interface
L2 Cache Region(0 - 256 Kb)
L2 SRAM Region(0 - 8 Mb)
Figure 2.5: A Two-level Cache Architecture (TI TMS320C6211/TI C6x DSP [5])
In a multi-level cache system, the level nearest the DSP (level 1) is opti-
mized for the high DSP core clock rate and low access latency. The size of this
level 1 (L1) cache may be constrained by the core clock rate. At the same time,
the outer levels can be optimized for storage density and power. Often the outer
cache levels have multi-cycle access time. The penalty for a miss from the in-
ner cache levels to an outer level that hits in the outer level is normally a small
integer number of clock cycles. Fig. 2.5 shows a two-level cache architecture
used in the TI TMS320C6211 [4] and TI C6x DSP families [5]. In this cache ar-
chitecture, the L1 memories consist of a small direct-mapped instruction cache
(I-cache) and a small two-way set associative data cache (D-cache), while the
2.1. CACHES FOR DSP AND EMBEDDED SYSTEMS 27
level 2 (L2) consists of a relatively-larger on-chip unified SRAM memory that
can be partially configured as a four-way set associative cache.
The TI TMS320C6211 uses separate L1 4-KB I-cache and D-cache, and
four 16-KB banks of on-chip SRAM memory that individually can be config-
ured as either local memory or a unified L2 cache. In the TI C6x DSP families,
the L1 cache consists of separate 16-KB I-cache and D-cache, while the L2 is
a 1-MB memory that can be mapped as all SRAM or as a mix of cache (up to
256-KB) and SRAM [5]. There are motivations behind the selection of size and
associativity for both L1 and L2 caches:
• Since most DSP algorithms consist of small, tight loops that execute the
same code on multiple data locations, a direct-mapped cache is suitable
for the L1 I-cache. The size of an L1 I-cache should be large enough
to accommodate multiple DSP kernels simultaneously to ensure a small
number of cache misses [3].
• As mentioned earlier in Section 2.1.1, DSP processors must be able to
efficiently perform each multiply-and-accumulate (MAC) operation, ide-
ally, in a single instruction cycle which requires at least two values be
read from and one value be written to the data memory. A two-way set
associative cache is suitable for an L1 D-cache since it keeps both MAC
operands in the cache, allowing simultaneous accesses to both operands
without going to the L2 cache or main memory. The size of the L1 D-
cache should be large enough to keep data for several DSP kernels loaded
simultaneously in the L1 I-cache.
• The size of the L2 memory is designed to be as large as possible because
misses are much less likely to occur. The L2 memory can be config-
ured as a unified on-chip SRAM memory or as a cache entirely, or as a
combination of cache and SRAM. The associativity of the L2 cache is
determined by how many of the banks are configured as caches, allowing
1-, 2-, 3- or 4-way associativity.
28 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE
An L2 memory can also be further optimized for a particular system by
selecting the appropriate parameters consisting of line size, allocation policies,
replacement policies, pipelining, prefetching, SRAM latency, etc. The L2 mem-
ory interfaces to the DMA controller for cache accesses and DMA transfers.
Data coherency between external memory and the L1 caches is maintained.
The L2 can be programmed to access various memory sizes with various access
latencies and also to allow CPU initiated DMA transfers [5].
Caches present in DSPs are typically adapted to suit DSP needs. For ex-
ample, the DSP may allow the programmer to manually “lock” portions of the
cache, so that performance-critical sections of the software can be guaranteed
to be resident in the cache. This helps to provide easy execution time predic-
tions at the cost of reduced performance for other sections of software that may
need to be fetched from main memory. Normally, DSP vendors are responsible
for providing programmers with tools that enable an accurate determination of
program execution times. These tools are great helps for programmers to im-
plement and optimize real-time DSP software, thus improving the performance
of DSPs.
2.2 Cache Architectures in GPP Systems
2.2.1 Cache System Architecture
Unlike DSP processors, on-chip caches are already commonly used in general
purpose processors (GPPs). By definition, cache is referred to the name given
to the first level of the memory hierarchy encountered once the address leaves
the CPU [6]. Fig. 2.6 shows a typical memory hierarchy used in embedded,
desktop and server computers [6]. A memory hierarchy takes advantage of
temporal locality by keeping more recently accessed data items closer to the
processor, and take advantage of spatial locality of reference data by moving
blocks consisting of multiple contiguous words in memory to upper level of the
hierarchy. Fig. 2.6 also shows that the memory hierarchy uses smaller and faster
2.2. CACHES FOR GPP SYSTEMS 29
memory technologies close to the processor. Therefore, if the hit ratio is high
enough, the memory hierarchy has an effective access time close to that of the
highest (and fastest) level and a size equal to that of the largest (and slowest)
level [7].
C
ac
h
e
Memory I/O Devices
Registers
Memory
Bus
I/OBusCPU
Figure 2.6: A typical memory hierarchy in GPPs
Cache hit ratio and access time are two metrics that determine the perfor-
mance of a cache system. There have been numerous studies on techniques
for achieving fast cache access while maintaining high hit ratios which include
selecting the appropriate cache parameters such as cache size, line size, set
associativity, allocation policies, replacement policies, pipelining, prefetching,
and SRAM latency, etc [8], [9], [10]. Those topics, however, are beyond the
scope of this thesis works, therefore it will not be summarized and studied in
this dissertation.
Theoretically, a memory hierarchy can consist of N cachelevels cache levels where
N cachelevels is an integer numbered as 1, 2, 3, etc. Depending on performance and
cost requirements for the cache system the number of cache levels must be
chosen to minimize the cache access time, as well as to maintain high cache
hit ratios [8]. In practical designs, N cachelevels ≤ 4 has been seen in most cache
hierarchies. As a rule of thumb, it is safe to say that the cache at the lowest level
is usually small, fast and often located on-chip while the one at the highest level
is often large, unified and it may or may not be located on-chip. Fig. 2.7 shows
a typical memory system used in GPPs where the cache system consists of two
levels: L1 and L2. The on-chip level 1 cache is split into separate I-cache and
D-cache to support the instruction and data fetch bandwidths of modern GPPs.
30 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE
The L2 cache is an off-chip unified memory used to store both instructions
and data [9]. The Translation Lookaside Buffer (TLB) is an on-chip cache
used to store those translated addresses that are used for translating virtual page
addresses to valid physical addresses. In Fig. 2.7, the size of cache increases
from the level 1 (lower) to the level 2 (higher), but the speed decreases. In other
words, the storage capacity and also the latency of a cache are increasing while
going from a lower to a higher cache level.
CPULevel 1
D-cache
Level 1I-cache
TLB
Microprocessor
RegisterFile
I/ODevices
MemoryBus
I/OBusLevel 2
UnifiedCache
MainMemory
Figure 2.7: A two-level cache architecture used in GPPs
CPULevel 1
D-cache
Level 1I-cache
TLB
Microprocessor
RegisterFile
Level 2
UnifiedCache
Level 3Cache
MemoryBus
MainMemory
Figure 2.8: A three-level cache architecture used in GPPs
In some recent designs, the L1 and the L2 caches are integrated on-chip, and
there is no L3 cache located between the L2 cache and the main memory, e.g.
the Intel Pentium 4, the Intel Pentium M, the Intel Xeon, the Intel Dual-core
Pentium D [11], and the AMD Dual-core Opteron [12]. Instead of using an
2.2. CACHES FOR GPP SYSTEMS 31
off-chip L3 cache, communications between an on-chip L2 cache and the main
memory are usually done through a memory controller that can be located on-
chip or off-chip. If the memory controller is integrated on-chip, the L2 cache is
connected to the memory controller through a high-speed back-side bus (BSB),
and if it is located off-chip, then the L2 cache is connected to the memory con-
troller through a slower front-side bus (FSB).
In several other designs, there are three levels of caches: An on-chip L1, an
on-chip L2 and an off-chip L3 cache (Fig. 2.8). For example, the IBM multi-
core Power5 microprocessor has a separate on-chip L1 cache (consisting of a
64-KB two-way set-associative I-cache, a 32-KB four-way set-associative D-
cache) for each core; an on-chip ten-way set-associative 1.875-MB L2 cache
shared between two cores; and an off-chip 36-MB L3 cache with an on-chip
directory [13]. The L3 cache is connected directly to the L2 cache through a
high-speed back-side bus, not via the on-chip memory controller, however.
L3
Ca
ch
e
L2
Ca
ch
e-
Qu
ad
Po
rt
ECC ECC
Bus (128-bit data, 6.4 GB/s @ 400 MT/s)ECC ECC
Branch & Predicate
Registers
128 Integer
RegistersSc
ore
bo
ard
,
Pred
ica
te,
NA
Ts
,
Ex
ce
ptio
ns
Floating
Point
Units
(x2)
Quad-port
L1 D-Cache
and DTLB
AL
AT
Branch
Units (x3)
Integer and
MM Units (x7)
128 FP Registers
L1 I-Cache and
Fetch/Pre-fetch EngineITLB
Instruction Queue 8 bundlesBranch
Prediction
IA-32
Decode
and
Control
M MM M I I F FB BB
Register Stack Engine / Re-Mapping
11 Issue
Ports
Figure 2.9: Block diagram of the Intel Itanium 2 processor [14]
32 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE
Besides, in order to reduce memory traffic in a multiprocessor configura-
tion, Intel has other versions of the Pentium 4 with much larger on-chip caches,
for example the Intel Xeon MP processor comes with an on-chip L3 of 1 MB or
2 MB or 4 MB, and the Intel Pentium 4 Extreme Edition processor comes with
an on-chip L3 of 2 MB [7]. Moreover, the Intel Itanium 2 processor (Fig. 2.9) –
a representative of the Intel’s IA-64 64-bit EPIC processor family – has an L1,
an L2 and an L3 cache integrated on-chip. It has a separate 16-KB four-way
associative L1 D-cache and I-cache, a 256-KB unified eight-way associative L2
cache, and a large unified 24-way set-associative L3 cache of either 3 MB or
6 MB or 9 MB in size [14]. Nevertheless, the Intel Itanium 2 is not yet the
Intel’s processor which have the largest caches on-chip. The Intel Dual-core
Itanium 2 processor even has a unified 24-MB low-latency L3 cache, and the
cache hierarchy is nearly 27-MB in total for the entire processor [11].
The above-mentioned examples of Intel processors suggest several trends
of development for cache implementation in recent GPPs: (i) More cache lev-
els are explored and implemented; (ii) Larger and larger caches are integrated
on-chip; (iii) More cores are integrated on a processor chip which requires even
larger on-chip caches and memories to provide necessary instructions and appli-
cation data for cores to run threads in parallel, and to store the obtained results.
2.3 Cache on GPPs and DSPs: Differences
Although both DSP and GPP caches are implemented to bridge the performance
gap between processor and main memory by maintaining fast access time and
high hit ratios, there are still some differences:
1. More levels of on-chip caches are used in GPPs, often three levels, while
a DSP cache system normally consists of two levels, thus far. The differ-
ence in implemented cache levels may be changed in future products, but
for the time being the cache systems used in DSPs are still one generation
behind the ones used in GPPs [15].
2.4. CACHE ORGANIZATION 33
2. In level 1, a cache usually is split into separate I-cache and D-cache in
both DSPs and GPPs, however the size of these caches is larger in GPPs.
In DSPs, the I-cache is a direct-mapped and the D-cache is a two-way
set-associative, whereas in GPPs the I-cache is normally two-way set-
associative and the D-cache is multiple-way set-associative.
3. In level 2, both GPPs and DSPs use large unified caches, however in
DSPs it can be configured either as an unified on-chip SRAM memory or
as a cache entirely, or as a combination of cache and SRAM. This ability
has not been seen in any GPP caches. In addition, the L2 caches in GPPs
tend to have more way set-associativity than the ones used in DSPs.
4. Compared to the caches used in GPPs, which are generally not visible and
not controlled by the application programmer, the cache systems in DSPs
are both visible to and controlled by the application programmers [3].
Unlike GPPs, DSPs do not generally use dynamic features such as the
branch prediction and the speculative execution. Therefore, predicting
the execution time for a given section of code is fairly easy on a DSP
which allows programmers to confidently push the DSP performance lim-
its [15].
2.4 Cache Organization
2.4.1 Basic Cache Organization
This section briefly describes the organization of a typical SRAM-based cache,
its working principles and circuitry. A detailed descriptions of cache organiza-
tions are given in [7] and [16].
Caches are normally organized as two-dimensional arrays. The first dimen-
sion is the set, and the second dimension is set associativity. The set ID is
determined by a function of the address bits of the memory request. The line ID
within a set is determined by matching the address tags in the target set with the
34 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE
referenced address. Caches with set associativity of one are commonly referred
to as the direct-mapped caches while caches with set associativity greater than
one are referred to as the set-associative caches. If there is only one set, the
cache is called fully-associative.
Inside a cache, each cache entry consists of data and a tag that identifies
the main memory address of that data. To identify if a cache block has valid
information, a valid bit (V) is added to each cache entry: V = 1 indicating that
the tag entry contains a valid memory address, otherwise the tag entry should
be ignored and there can not be a match for this block. A memory request hits
in the cache when the upper bits of the reference address and the tag are equal,
and the data is supplied to the processor. Otherwise, a miss occurs.
Fig. 2.10 shows the basic organization of a 16-KB direct-mapped cache used
in the Intrinsity FastMATH Adaptive Signal Processor [7] which contains 256
blocks with 16 words (i.e. 512 bits) per block. The byte-offset, the block-offset
and the index fields are two-bit, four-bit and eight-bit wide, respectively, and
the tag field is 18-bit wide. The 8-bit index field defines the number of cache
entries (i.e. 256) whereas the four-bit block-offset field is used to select a word
from a block using a 16-to-1 multiplexor.
There are two parts in a cache access: (i) Accessing the tag array and per-
forming the tag comparison by comparing the memory-requested address with
the stored address tag to determine if the data is in the cache; (ii) Accessing
the data array to bring out the requested data. For a set-associative cache, the
results of the tag comparison are used to select the requested line from within
the set driven out of the data array.
In practice, a cache is divided into two separate arrays: a small tag array,
and a larger SRAM data array. Fig. 2.11 shows the organization of a typical
SRAM-based cache given in CACTI [16]. This organization is used as the ba-
sic organization assumed for power modeling throughout this dissertation.
2.4. CACHE ORGANIZATION 35
Byte
Offset
Block
OffsetIndexAddress Tag bits
Address (showing bit positions). . .
. . .
. . .
. . .
. . .
. . .
V Tag Data
18 bits 512 bits
256
entries
8
AND
418
32 323232
Hit
=Comparator
MUX
32
Data
18
31 14 13 6 5 2 1 0 ... ... ...
Figure 2.10: Basic organization of a Direct-mapped cache [7]
The access procedure to the assumed cache (given in Fig. 2.11) consists of
precharge and evaluation phases that can be divided into the following steps:
1. Address decoding: Address bits are inputs to the row and column de-
coders (showed as a decoder in Fig. 2.11). For each address combination,
the row decoder drives exactly one tag wordline and one data wordline in
the tag and data arrays, respectively, while the column decoder selects a
set of bitline pairs (BL/BL) in the tag and data arrays. Thus, for each
36 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE
Tag
Wordlines
Address Input
Valid output Data outputs
Data Bitlines
Data
Wordlines
...
...... ...
...
. .
.
. .
.
...
DATAARRAY
TAGARRAY
D E
C O
D E
R
Tag BitlinesSRAM-based
Tag cellSRAM cell
MemoryCells
MUX Drivers
...
Total outputs
...
Total-Output Driver
Output Drivers...
Tag Comparators
SenseAmplifiers
ColumnMultiplexers
Data
OutputDrivers
TotalOutputDriver
Addresstag bits
(Search Lines)Comparators.
. .
... ...
Tag readouts Data readouts
MUX Driver
outputs
Match lines
Figure 2.11: Basic organization of a typical SRAM-based cache
2.4. CACHE ORGANIZATION 37
address combination only a set of memory cells in the tag and data arrays
is selected.
2. Bitline precharging: A simple precharge scheme is used that precharges
all bitline pairs in the tag and data arrays to Vdd during the precharge
phase. The precharge scheme is deactivated during the evaluation phase.
3. Selecting memory cells: The evaluation phase starts when the row de-
coder fires, driving a wordline high. Each memory cell in the selected
row pulls down one of its two bitlines; the value stored in the memory
cell determines which bitline goes low.
4. Preparing for sensing: The column decoder fires and connects sense
amplifiers to their selected bitline pairs through a multiplexer (MUX).
This step is needed only if the number of sense amplifiers is less than the
number of bitlines, i.e. more than one bitline pairs share a sense amplifier.
5. Sensing: Each sense amplifier (SA) monitors a pair of bitlines and de-
tects when one changes. By detecting which bitline goes low, the sense
amplifier determines the content of the selected memory cell. Voltage
differential sense amplifiers are assumed to be used in both the tag and
data arrays. In order to minimize the sensing time, bitlines of all sense
amplifiers are precharged to high during the precharge phase.
6. Tag comparing: The information read from the tag array is compared
to the address tag bits to determine if the requested block exists in the
cache. The number of comparators needed is equal to the number of way
used for cache’s set-associativity, e.g. for a direct-mapped cache only one
comparator is needed.
7. Checking the valid bit and selecting the data: The valid bit is checked
first to know if the entry contains a valid address. If V is set, and if the
tag comparison is successful, which means the requested block is found
in the cache (a cache hit), the MUX drivers are set to select the proper
38 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE
data from the data array. If V is not set and/or the tag comparison is un-
successful, a cache miss occurs. The processor control unit, together with
a separate controller (neither is shown in Fig. 2.11), is responsible for de-
tecting a miss and processing the miss by fetching the requested data from
a lower-level cache or from the main memory. When the requested data
are available, a write into the cache occurs: (i) putting the requested data
in the data portion of the cache entry; (ii) writing the upper bits of the ad-
dress into the tag field; (iii) turning the valid bit on. On a cache miss, the
processor is simply stalled until the lower-level cache/main memory re-
sponds with the requested data. Then, the stalled cache access is restarted,
this time finding the data in the cache.
8. Driving out data: All output data from the cache (i.e. a valid bit and the
selected data) are driven to the appropriate bus through the total-output
drivers.
Thus, there are two potential critical paths in a cache access: the tag-array-
access and the data-array-access. The tag-array-access path consists of (i) read
the tag array; (ii) perform the tag comparison; (iii) drive the multiplexor select
signal. The data-array-access consists of (i) read the data array; (ii) drive the
data to the multiplexor. If the time to perform the tag-array-access is larger than
the time to do the data-array-access then the tag side is the critical path. Other-
wise, the data side is the critical path.
In practice, what side will be the critical path depends strongly on the cache
organization parameters (e.g. cache size, associativity, line size, data word
length), process technology parameters, and types of circuits used to imple-
ment components of the cache. A detailed descriptions of cache components
and their circuitry are given in Chapter 5 of this dissertation.
2.4.2 Memory Partitioning
Partitioning is one of the most successful techniques for memory energy op-
timization, where a large memory array is subsequently divided into smaller
2.4. CACHE ORGANIZATION 39
arrays in such a way that each of these can be independently controlled and ac-
cessed. The aim of the partitioning approach is to find the best balance between
energy savings, delay and area overheads. Partitioning of memory can be at two
levels: logical and physical [17].
Logical partitioning involves creating several smaller memory macros in-
stead of the original single large array, and then synthesizing a control logic to
activate the different memory macros. In this approach, each memory macro
is actually a separate memory array with a smaller size including decoders,
precharge and read/write circuits of its own. Control logic is added on top to
activate one array at a time based on address inputs. Since this scheme requires
extra control circuitry, some extra wiring and multiple decoders, precharge and
read/write circuits, designers always try to strike a balance between the energy
savings from having small arrays and the overhead for supporting them.
Physical partitioning, on other hand, involves dividing the original array
into several sub-arrays sharing decoders, precharge and read/write circuits, and
then synthesizing internal control circuitry to provide mutually exclusive acti-
vation of the sub-arrays inside the original array. Moreover, the internal con-
trol circuitry is merged with row/column decoders to generate sub-array selec-
tion signals, so introduction of extra circuitry is efficiently limited. In physi-
cal partitioning, memory arrays can be partitioned horizontally using a divided
word-line (DWL) technique proposed by Yoshimoto et al. [18], vertically using
a hierarchical divided bit-line (DBL) technique presented by Karandikar and
Parhi [19], or bidirectionally using a combination of both techniques [17]. Due
to their advantages in energy-efficiency and ease of implementation, physically
partitioned memory arrays are widely used in L1 and/or L2 caches of recent
microprocessors and DSPs.
40 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE
Bibliography
[1] Dake Liu, Compendium: Design of Embedded DSP Processors, Department of
Electrical Engneering, Linköping University, Linköping, Sweden, second edition,
2004.
[2] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999.
[3] J. Eyre and J. Bier, “DSP Processors Hit the Mainstream,” IEEE Computer, vol.
31, no. 8, pp. 51–59, Aug. 1998.
[4] S. Agarwala, C. Fuoco, T. Anderson, D. Comisky, and C. Mobley, “A Multi-
level Memory System Architecture for High Performance DSP Applications,” in
Proceedings of International Conference on Computer Design (ICCD), September
2000, pp. 408–413.
[5] S. Agarwala, T. Anderson, A. Hill, M.D Ales, R. Damodaran, P. Wiley,
S. Mullinnix, J. Leach, A. Lell, M. Gill, A. Rajagopal, A. Chachad, M. Agarwala,
J. Apostol, M. Krishnan, Bui Duc, An Quang, N.S. Nagaraj, T. Wolf, and T.T.
Elappuparackal, “A 600-MHz VLIW DSP,” IEEE Journal of Solid-State Circuits,
vol. 37, no. 11, pp. 1532–1544, Nov. 2002.
[6] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Ap-
proach, Morgan Kaufmann, fourth edition, 2006.
[7] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The
Hardware/Software Interface, Morgan Kaufmann, third edition, 2005.
[8] B. L. Jacob, P. M. Chen, S. R. Silverman, and T. Mudge, “An Analytical Model
for Designing Memory Hierarchies,” IEEE Transaction on Computers, vol. 45, no.
10, pp. 1180–1194, Oct. 1996.
[9] N. P. Jouppi and S. J. E. Wilton, “Tradeoffs in Two-level On-chip Caching,” in
Proceedings of the Anual International Symposium on Computer Architecture, Apr.
1994, pp. 34–45.
[10] J. K. Peir, W. W. Hsu, and A. J. Smith, “Functional Implementation Techniques
for CPU Cache Memories,” IEEE Transaction on Computers, vol. 48, no. 2, pp.
100–110, Feb. 1999.
[11] Intel Pressroom Homepage, http://www.intel.com/pressroom/, 2007.
[12] AMD Homepage, http://www.amd.com, 2007.
BIBLIOGRAPHY 41
[13] B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner, “Power5
system architecture,” IBM Journal of Research & Devepolments, vol. 49, no. 4, pp.
505–521, Sept. 2005.
[14] S. Rusu, H. Muljono, and B. Cherkauer, “Itanium 2 Processor 6M: Higher Fre-
quency and Larger L3 Cache,” IEEE Micro, vol. 24, no. 2, pp. 10–18, Apr. 2004.
[15] J. Eyre, “The digital signal processor derby,” IEEE Spectrum, vol. 38, no. 6, pp.
62–68, June 2001.
[16] S.J.E. Wilton and N.P. Jouppi, WRL Research Report 93/5: An Enhanced Access
and Cycle Time Model for On-chip Caches, Western Research Laboratory, 1994.
[17] P. Sithambaram et al., “Design and Implementation of a Memory Generator for
Low-Energy ASBE SRAMs,” in PATMOS 2005, Sept. 2005, pp. 477–87.
[18] M. Yoshimoto et al., “A divided word-line structure in the static RAM and its
application to a 64K full CMOS RAM,” IEEE JSSC, vol. 18, no. 5, pp. 479–85,
Oct. 1983.
[19] A. Karandikar et al., “Low Power SRAM Design Using Hierarchical Divided Bit-
line Approach,” in ICCD 1998, Oct. 1998, pp. 82–8.
42 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE
3Power Dissipation in CMOS
The goal of this chapter is to explain the most important mechanisms behind
power dissipation of CMOS circuits. This is essential for the readers who wish
to understand the probing used in the component characterization phase given
in Chapter 5 of this dissertation. Section 3.1 first gives some background in-
formation on mechanisms of power dissipation in CMOS circuits. Then, Sec-
tion 3.2 provides some insights into the trends of leakage power dissipation
in current process technologies, and emerging issues. Finally, Section 3.3 de-
scribes some useful power reduction/cut-off techniques to combat dynamic and
leakage power dissipation in digital circuits, caches and SRAM arrays.
43
44 CHAPTER 3. POWER DISSIPATION IN CMOS
3.1 Mechanisms of Power Dissipation
Based on the behavior of digital CMOS circuits and mechanisms for power
dissipation, total power dissipation of a digital circuit can be decomposed into
two main components: static (the power consumed when the circuit is in the
‘steady state’) and dynamic (the power consumed during switching, when the
circuit is in the ‘active state’).
p
Bulk
DrainSource
Gate
n+n+
I7 I
8
I2 I
3 I
6
I1
I5
I4
Figure 3.1: Leakage mechanisms in an off-state NMOS transistor with VG = VS = 0
and VD = Vdd
While the dynamic type of power dissipation consists of switching power,
glitching power and short-circuit power, the static type includes many more
power dissipation mechanisms. Fig. 3.1 illustrates significant leakage mech-
anisms which exist in an off-state NMOS transistor [1]: The reverse-bias pn
junction leakage (I1), the subthreshold leakage (I2), the Gate-Induced-Drain
Leakage (I4), the channel punchthrough current (I5), the gate oxide leakage
(I7), and the gate current due to hot-carrier injection (I8). The I3 and I6 are the
subthreshold leakage caused by the Drain-Induced Barrier-Lowering (DIBL)
and the Short-Channel Effect (SCE) together with the narrow-width effect via
VT modulation, respectively. Currents I2, I3, I4, I5, I6 are off-state leakage
mechanisms, while I1 and I7 occur in both on-state and off-state. The current
3.1. MECHANISMS OF POWER DISSIPATION 45
I8 can occur in the off-state, but more typically occurs during the transition of
transistor bias [2].
Of the above mentioned types of power dissipation, switching power is the
one that has been, so far, considered by the high-level power estimation commu-
nity to be the completely dominating source of power dissipation. The second
significant source is considered to be the subthreshold leakage power. Besides,
as technology scaling reaches lower than 70 nm, gate oxide leakage has become
one of the significant contributors to the total leakage power dissipation, [3]. As
technology scaling continues to very deep submicron pn junction leakage also
receives a lot of attention and it is already considered as another significant
source of leakage power dissipation for future CMOS processes.
Based on the degree of importance, only four sources of power dissipations
will be described in more detail in this section: switching power, subthreshold
leakage, gate leakage, and pn junction leakage power. More detailed descrip-
tions of other sources of power dissipations are given in [2] [4] [5].
3.1.1 Dynamic Power
Switching Power
Switching power constitutes the major part of the total power dissipation in to-
day’s and in future digital CMOS circuits. Although it has been reduced by
various techniques like supply voltage scaling and clock gating, it still will be
the dominating power dissipation for future technologies.
The switching power basically is the power consumed during charging and
discharging of the capacitances associated with each circuit node and it can be
summarized in the following equation as:
Pswitching = α CL V 2dd fclk = α CL ∆V Vdd fclk (3.1)
46 CHAPTER 3. POWER DISSIPATION IN CMOS
Here, CL is the load capacitance, fclk is the clock frequency, Vdd is the supply
voltage, ∆V is the swing voltage of the node, and α is the node ’0→1’ transition
activity factor which is defined between 0 and 1.
3.1.2 Leakage Power
Subthreshold Leakage
Ideally, CMOS circuits dissipate no static (DC) power since in the steady state
there is no direct path from Vdd to ground. Of course, this assumption can
never be realized in practice since in reality the MOS transistor is not a perfect
switch, which means that there will always be leakage currents even when the
MOS transistors are OFF. The subthreshold leakage power is due to the sub-
threshold leakage current, i.e. the drain to source current running through the
channel occurring due to a potential difference between source and drain even
if the gate voltage is far below the threshold voltage.
In the weak inversion region, the subthreshold current can be calculated by
Eq. 3.2 taken from [5] [6]. From this equation, it is clear that the subthreshold
current depends strongly on different technological parameters, especially the
threshold voltage and temperature.
Isub = I0 (1 − e−VdsVth ) e
−VT −VoffnVth (3.2)
where,
I0 = µW
LV 2
th
√
q ǫsi NDEP
2φs; Vth =
kB T
q(3.3)
Here, q is the electrical charge, T is the varying temperature, n is the sub-
threshold swing coefficient, kB is the Stefan-Boltzmann constant, NDEP is the
channel doping concentration (similar to Nch defined in [7]), φs is the surface
potential, ǫsi is the dielectric constant of silicon, µ is the carrier mobility at
TNOM , Vth is the thermal voltage, Vds is the drain-source voltage, Voff is the
3.1. MECHANISMS OF POWER DISSIPATION 47
offset voltage, W is the width, L is the length, and VT is the device threshold
voltage.
At the fixed nominal temperature (TNOM = 27 ◦C), VT is defined by a
very complex expression (Eq. 3.5) accounting for all effects such as the body
effect, ∆VT,body_effect, the charge sharing, ∆VT,charge_sharing , the DIBL,
∆VT,DIBL, the reverse short-channel, ∆VT,reverse_short_channel, the narrow-
width effect, ∆VT,narrow_width, the small-size effect, ∆VT,small_size, and the
pocket implant, ∆VT,pocket_implant. The VTH0 is the threshold voltage of a
long-channel device at zero bias, and the δNP is defined as +1 for NMOS and
as −1 for PMOS. For more detailed equations, see [6].
VT = VTH0 + δNP (∆VT,body_effect − ∆VT,charge_sharing
− ∆VT,DIBL + ∆VT,reverse_short_channel (3.4)
+ ∆VT,narrow_width + ∆VT,small_size − ∆VT,pocket_implant)
Then, the dependence of subthreshold leakage on the varying temperature,
T , is modeled by using temperature-dependent scaling equations:
VT (at T ) = VT (at TNOM) + KT (T
TNOM− 1) (3.5)
µ (at T ) = µ (at TNOM) (T
TNOM)UTE (3.6)
KT = KT 1 +KT 1L
Leff+ KT 2 Vbseff (3.7)
Here, Vbseff is the effective bulk-source voltage, KT 1 is the temperature coef-
ficient of the threshold voltage, KT 1L is the channel-length coefficient of the
threshold voltage’s temperature dependence, KT 2 is the bulk-bias coefficient
of the threshold voltage’s temperature dependence, UTE is the temperature co-
efficient for the zero-field universal mobility µ0, and Leff is the effective gate
length.
Eqs 3.2 - 3.7 show how complicated it is to calculate a value of subthreshold
leakage current for a single MOS transistor analytically. Thus, given the high
48 CHAPTER 3. POWER DISSIPATION IN CMOS
number of transistors typically found in digital circuits, to accurately estimate
subthreshold leakage power is obviously a time-consuming and challenged task
requiring high computating time.
Gate Oxide Leakage
While subthreshold leakage is still the major source of static power dissipation
in today’s technologies, gate leakage is catching up, especially for technolo-
gies lower than 50 nm [3]. For a 2004 technology generation device the gate
leakage power already contribute as much as 15% of the total power dissipa-
tion [8]. Therefore if there will not be any solutions to handle efficiently the
gate leakage problem (including substantially better high-k material and gate-
leakage suppression circuit techniques), then this scenario will be our reality in
less than 10 years time [3].
The gate leakage is due to the direct tunneling currents that penetrate the thin
gate insulator. Unlike the subthreshold leakage, gate leakage is present in both
off-state and on-state MOS transistors which makes gate leakage more difficult
to control than the subthreshold one. In an on-state transistor, the gate leakage
is the sum of two components: the gate-to-channel and gate-to-source/drain
extension (gate-to-SDE) overlap currents, while in an off-state transistor it is
equal to the edge-direct tunneling (EDT) current [9]. Therefore, gate leakage
strongly depends on the voltage potential on the transistor gate, VG, the gate
oxide thickness, Tox, the gate oxide insulator material, K , and the width of
transistor, W , rather than on temperature. The gate leakage current can be
approximately defined by Eq. 3.8 taken from [10]. It is clear from Eq. 3.8 that
gate leakage will reduce if Tox increases. However, it is not a good option since
an increase in Tox also degrades the transistor’s effectiveness.
Igate = K W (Vdd
Tox)2 e
−αToxVdd (3.8)
Here, parameters K and α can be derived experimentally.
3.1. MECHANISMS OF POWER DISSIPATION 49
Recently, gate leakage power has been intensively studied by many re-
searchers and some solutions to this problem have been proposed. These pro-
posed solutions include the introduction of a high-k material [11] and several
gate-leakage-suppression circuit techniques [12]. Hence, the overall picture
fortunately will not be that bad as it was predicted in [3].
pn Junction Reverse-Bias Leakage
The pn junction leakage is due to the currents running across the reverse-biased
drain- and source-to-well junctions. It has two major components: (i) minority
carrier diffusion/drift near the edge of the depletion region; (ii) electron-hole
pair generation in the depletion region of the reverse-biased junction [2]. The
junction leakage current is a function of junction area and doping concentration
of p and n regions. If both p and n regions are heavily doped, band-to-band tun-
neling (BTBT) dominates the pn junction leakage. In advanced MOS transistors
heavily doped shallow junctions and halo doping are often used to reduce the
Short-Channel Effect (SCE), that is why the pn junction leakage is usually re-
ferred as the BTBT leakage in recent advanced CMOS technological processes.
The BTBT current can be estimated using Eg. 3.9, where m∗ is the effective
mass of electron; Eg is the energy band gap; Vapp is the applied reverse bias;
E is the electric field at the junction (should be more than 106 V/cm); q is
the electron charge; h is 1/2π times Planck’s constant [2]; Na and Nd are the
doping in the p and n regions, respectively; and Vbi is the built-in voltage across
the junction.
JBTBT = AE Vapp
E1/2g
e−B E
3/2
gE (3.9)
where,
A =(2m∗)1/2 q3
4 π3 h2 , B =4 (2m∗)1/2
3 q h(3.10)
E =
√
2 q Na Nd (Vapp + Vbi)
ǫsi (Na + Nd)(3.11)
50 CHAPTER 3. POWER DISSIPATION IN CMOS
From Eqs 3.9 - 3.11, it is obvious that the BTBT leakage current strongly
depends on doping concentration and the total voltage drop across the junction
which needs to be more than the energy band gap to create tunneling.
3.2 Trend of Development and Emerging Issues
The International Technology Roadmap for Semiconductors (ITRS) [13] pre-
dicts the trends of development for future process technologies to meet some
key scaling goals: (i) for High Performance (HP) applications the key target is
to maintain the historical 17% per year transistor performance increase; (ii) For
Low Power chips (e.g. mobile applications), the target is specifically low level
of leakage current. For example, for Low STandby Power applications (LSTP)
the goal is very low leakage for lower performance targeting comsumer appli-
cations, while for Low Operating Power applications (LOP) the target is low
dynamic power with relatively higher performance. Recent ITRS main predic-
tions can be briefly summarized as:
1. Technology-node concept: The traditional simple ITRS technology-node
concept was rapidly becoming too much of an oversimplification of the
industry state-of-the-art. This problem has been reflected in recent years
by the growing confusion by many researchers when using one single typ-
ical process to represent a "technology node" reflected in press releases,
conference presentations/publications, etc. Obviously, the technology-
node concept has outlived its usefulness. Therefore, it will be gradually
abandoned.
2. Alternative technology: It is expected to become increasingly difficult to
effectively scale planar bulk CMOS devices beyond the 65-nm technol-
ogy generation (with the physical gate length = 25-nm). Major problems
are (i) adequately controlling SCE is projected to become especially prob-
lematic; (ii) the channel doping will need to be increased to exceedingly
high values, causing a reduction in the mobility and very high BTBT
leakage current between drain and body; (iii) total number of dopants
3.2. TREND OF DEVELOPMENT AND EMERGING ISSUES 51
in the channel becomes relatively small resulting in unacceptably large
statistical variation of the threshold voltage; Thus, a potential solution
is to utilize ultra-thin body, fully depleted SOI MOSFETs. Single SOI
MOSFETs are projected for 2008 for high-performance logic, while more
complex and more scalable multiple-gate SOI MOSFETs are projected to
be implemented in 2011.
3. Gate oxide scaling: For extended planar bulk CMOS devices, it was
projected by ITRS 2005 that high-k gate dielectric and metal gate tech-
nology will be required by 2008 to control the leakage. However, the
deployment of high-k gate dielectric and metal gate electrodes is delayed
by two years, until 2010. The Equivalent Oxide thickness (EOT), defined
as Td/(κ/3.9) to represent the relation between the gate dielectric of thick-
ness Td and relative dielectric constant κ, continues to scale but its rate of
scaling is quite slow from 2005 through 2007. However, there is a sharp
EOT decrease in 2008, when it is assumed that high-k gate dielectric will
be implemented (Fig. 3.2).
4. Supply voltage: continues to scale (but not very impressively) from
1.1 V in 2007 (for the 65-nm technology generation with the physical
gate length of 25 nm) to 0.9 V in 2013 (for the 32-nm technology gener-
ation with the physical gate length of 13 nm).
5. Major leakage components: The three types of leakage mechanisms
that continue to be the major ones for future processes are: Subthresh-
old, gate oxide and reverse-biased, drain- and source-substrate junction
BTBT. With technology scaling each of the major leakage components
increases drastically contributing significantly to a dramatic increase in
total leakage. Fig. 3.3 shows a near-term prediction of gate and sub-
threshold leakage for future process technologies until 2012.
6. Emerging research: MOS scaling will likely become ineffective and/or
very costly, therefore novel, non-CMOS devices and/or circuits or archi-
tectures are the potential solutions.
52 CHAPTER 3. POWER DISSIPATION IN CMOS
Figure 3.2: EOT and gate leakge density scaling for extended planar bulk CMOS devices
(ITRS 2006)
Figure 3.3: Scaling in subthreshold leakage for extended planar bulk CMOS devices
(ITRS 2006)
3.3. LEAKAGE POWER REDUCTION TECHNIQUES 53
Figure 3.4: Gate length scaling for extended planar bulk CMOS devices (ITRS 2006)
3.3 Leakage Power Reduction Techniques
Power-saving techniques are widely used across levels of design abstraction, i.e.
software, architecture, circuits, devices, and technology [14]. Some approaches
utilizing cooperations between different levels of design abstraction have also
been reported [15]. Power-saving techniques are usually designed for operating
in two different modes of circuit operation: active and sleep. Depending on cir-
cuit topology and the associated major sources of power dissipation, different
techniques are employed to achieve an efficient reduction in total power dis-
sipation. Architecture-level power-saving techniques include dynamic-voltage
scaling (DVS), clock gating, frequency-voltage control and multi-processor de-
sign. Circuit-level techniques include selection of logic style, transistor sizing,
transistor reordering, logic-gate restructuring, gated clocks, optimizing inter-
54 CHAPTER 3. POWER DISSIPATION IN CMOS
connect, layout consideration, power cut-off techniques, and low power SRAM
design with virtual ground [16]. Despite the wide range of power-saving tech-
niques, this section focuses mainly on a survey of those circuit-level power cut-
off techniques used for leakage power reduction in digital circuits and SRAM.
BA
I
ddvV
I
I
I I
I
tot
sthBI
sthC
gC
CgB
sthA
gA
(Virtual supply)gpV
Figure 3.5: Leakage current paths in the SCCMOS technique (from [12])
3.3.1 Power Cut-off Techniques
Some power cut-off techniques have mainly targeted subthreshold leakage which
include Super Cut-Off CMOS (SCCMOS) [17], Multi-threshold CMOS (MTC-
MOS) [18] and its enhanced version – Zigzag Super Cut-Off CMOS (ZSCC-
MOS) [19]. These techniques suppress subthreshold leakage currents when a
logic circuit is not active, i.e. it is in the sleep mode. Since a circuit dissipates
leakage power not only in sleep mode, but also in active mode (referred to as
active leakage), MTCMOS has recently been used in conjunction with a clock-
gating technique to reduce both dynamic and leakage power when the circuit is
active [20]. For this type of applications, the wake-up time of a power cut-off
technique is an important issue. While MTCMOS and SCCMOS have wake-up
times of several clock cycles, ZSCCMOS can offer a wake-up time of less than
one clock cycle by employing a sophisticated scheme with a virtual ground rail.
However, the efficiency of the ZSCCMOS technique is degraded due to gate
leakage currents [21].
3.3. LEAKAGE POWER REDUCTION TECHNIQUES 55
B
off
CIgA
A gC
sthC
gBII
I
gngndvV
IsthB
IsthA
on
offoff
on
on
ddvV
(Virtual ground)
(Virtual supply)Vgp
V
Figure 3.6: Leakage current paths in the ZSCCMOS technique (from [12])
Fig. 3.5 shows the leakage current paths in the SCCMOS technique where
Ig , Isth are the gate and subthreshold leakage currents in inverters A, B, and
C, respectively, and Itot is the total leakage current. In ZSCCMOS (Fig. 3.6),
the virtual power rails are connected to the logic transistor nets that are OFF in
sleep mode, while the conducting transistor nets use the external power rails.
This scheme cuts off the subthreshold current paths (dashed arrows), however
the gate leakage paths (solid arrows) from external supply to ground rails re-
main resulting in the voltage level across the gate insulators being close to Vdd.
Thus, this technique is inefficient for gate leakage reduction.
The Gate leakage Suppressing CMOS (GSCMOS) technique is shown in
Fig. 3.7, in which an additional virtual supply rail with a separate power switch
was added [12]. The added virtual supply rail 2, connecting to the logic tran-
sistor nets that are conducting while in sleep mode, effectively eliminates all
gate leakage paths. The wake-up time due to the added virtual supply rail can
be limited through some design steps: (i) In sleep mode, the GSCMOS circuit
is forced to the state for which gate and subthreshold leakage components to-
gether exhibit minimal current; (ii) In active mode, the power switches are sized
for equal voltage drops (bounces) on each of the virtual power rails in the worst
case scenario [12].
56 CHAPTER 3. POWER DISSIPATION IN CMOS
B
off
CIgA
A gC
sthC
gBII
I
gn
IsthB
IsthA
on
offoff
on
on
VVirtualground
gpVVirtualsupply (1)
Virtualsupply (2)W 2 W 1
Figure 3.7: Leakage current paths in the GSCMOS technique (from [12])
Following these design steps, power switches are distributed and sized in
such a way so that they equalize the charge (discharge) times of the virtual
power rails for a transition from sleep to active modes. When the logic circuit
operates in active mode there are gate leakage currents in the on-state power
switches, however this leakage is very small compared to the dynamic cur-
rents, and therefore is negligible. Due to oxide-stress reliability issues, GSC-
MOS as well as SCCMOS and ZSCCMOS require oxide-stress relaxed level
shifters [22] to generate control voltages (Vgn,Vgp) for the power switches. To
force the logic inputs to the required state GSCMOS must, in the same way
as for ZSCCMOS, employ flip-flops using a phase forcing circuit [22]. When
GSCMOS goes into sleep mode the voltage of virtual supply rail 2 drops to ∼
Vdd/2. As a result the internal voltages are undefined. Thus, like for SCCMOS
and ZSCCMOS, GSCMOS must store data in external SRAM cells that are not
connected to virtual power rails [23] before it enters sleep mode.
3.3.2 Leakage-Reduction Techniques for SRAM-based Caches
Leakage-reduction techniques for SRAM-based cache and memory have been
studied intensively by many authors. The main techniques include e.g. drowsy-
3.3. LEAKAGE POWER REDUCTION TECHNIQUES 57
caches [24], gated-Vdd [25], gated-ground [26], dual-VT [27], MTCMOS [28],
dynamic-VT SRAM [29], and reverse/forward body-biased SRAM [30].
The drowsy-cache technique utilizes the dynamic-voltage-scaling (DVS)
principle to reduce leakage power. In active mode, a nominal supply voltage
is provided to memory cells, while in sleep or drowsy mode, a stand-by inter-
mediate voltage level is applied to memory cells to reduce the leakage power.
The stand-by voltage must be higher than the minimum state-preserving voltage
considering process variations such as transistor VT and channel length [24]. In
the drowsy mode, accesses to memory cells are not allowed because the voltage
level of BL/BL is higher than that of the cross-coupled inverters inside the cell,
otherwise it may destroy the state information stored in the cell. Besides, the
sense amplifier may not operate properly due to insufficient driving capability
of the accessed cells.
The basic concept of dual-VT is to use low-VT , faster and leakier transistors
for circuits in the critical path, and high-VT , slower transistors for the rest of
the circuits to suppress unneccessary subthreshold leakage currents. Normally,
in SRAM-based cache designs, the low-VT transistors have been used in the
peripheral circuits of the caches, and in the pass-transistors connecting mem-
ory cells to BLs/BLs while high-VT transistors are used for memory cells [27].
This technique requires no additional control circuitry and can significantly re-
duce the subthreshold leakage currents compared to low-VT devices. Besides,
no data are discarded and no additional cache misses are incurred. However,
this technique suffers from longer bitline delay due to the high-VT devices that
have slower switching speed and lower current drive.
The gated-Vdd and gated-ground techniques reduce leakage power by plac-
ing high-VT transistors between the circuit and power supply rails, i.e. Vdd
and ground, respectively, to turn off the supply power of the memory cell when
the cell is in the low-power mode. These high-VT gating transistors effectively
reduce the subthreshold leakage power of the memory cell circuit because of
58 CHAPTER 3. POWER DISSIPATION IN CMOS
the stacking effect and the exponential dependence of the subthreshold leakage
on VT . The main disadvantage of these techniques is that all state informa-
tion within the memory cell is lost, which may inflict a significant performance
penalty when the memory cell is accessed, and require a complex and conserva-
tive cache management policy to handle it. Furthermore, these gating transistors
are in the critical path, thus resulting in increased access time of the caches.
Leakage currents can be reduced by dynamically raising the transistor VT
using the principle of modulating the back-gate bias voltage [28] [29] [31]. Dur-
ing normal operation, the memory cell is connected to Vdd and ground and back-
gate voltages are set to the appropriate power rails. When sleep is activated, the
p-channel wells are biased using an alternative power supply voltage, Vdd+, at
a higher voltage level than the source terminals raising the effective VT . All
transistors inside memory cells experience higher VT and therefore the leakage
currents are reduced significantly. The major advantage of MTCMOS is that
memory cell values are preserved during sleep mode whereas the disadvantages
include (i) an additional power-supply voltage that must be distributed through-
out the array; (ii) larger electric fields placed across the transistor gates during
sleep which may affect the reliability of memory cells; (iii) a latency penalty to
awaken a line being in the sleep mode before data can be accessed [28].
Among the above-mentioned techniques, drowsy caches have received con-
siderable attention; it was shown in [32] that total cache leakage energy was
reduced by an average of 76% at a wakeup penalty, for a drowsy cache line, of
no more than one cycle. Moreover, drowsy caches can be implemented easily
using simple control circuits to assign different voltage levels, called tranquility
levels, at different priority levels, based on information of replacement policies
used [33]. These advantages make drowsy cache one of the most widely used
techniques for leakage reduction in caches and SRAM arrays.
BIBLIOGRAPHY 59
Bibliography
[1] A. Keshavarzi, K. Roy, and C. F. Hawkins, “Intrinsic leakage in low power deep
submicron CMOS IC’s,” in Proceedings of International Test Conference (ITC),
1997, pp. 146–155.
[2] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leakage Current Mech-
anisms and Leakage Reduction Techniques in Deep-Submicrometer CMOS Cir-
cuits,” Proceedings of the IEEE, vol. 91, no. 2, pp. 305–327, Feb. 2003.
[3] D. Helms, E. Schmidt, and W. Nebel, “Leakage in CMOS Circuits - An Introduc-
tion,” in Proceedings of International Workshop on Power and Timing Modeling,
Optimization and Simulation (PATMOS’04), LNCS 3254, Sept. 2004, pp. 17–35.
[4] A. Agarwal, S. Mukhopadhyay, A. Raychowdhury, K. Roy, and C. H. Kim, “Leak-
age Power Analysis and Reduction for Nanoscale Circuits,” IEEE Micro, vol. 26,
no. 2, pp. 68–80, Apr. 2006.
[5] W. Liu, MOSFET Models for SPICE Simulation including BSIM3v3 and BSIM4,
John Wiley & Sons, Inc., 2001.
[6] Univ. California Berkeley Device Group, BSIM4.2.1 MOSFET Model: User’s
Manual, Dept. of EECS, Univ. of California, Berkeley, CA 94720, USA, 2002.
[7] University of California Berkeley Device Group, BSIM3v3.2.2 Manual, Device
Research Group of the Dept. of EE and CS, University of California, Berkeley,
1999.
[8] R. M. Rao, J. L. Burns, A. Devgan, and R. B. Brown, “Efficient Techniques for
Gate Leakage Estimation,” in Proceedings of International Symposium on Low
Power Electronics and Design (ISLPED), Sept. 2003, pp. 17–35.
[9] M. Draždžiulis and P. Larsson-Edefors, “A Gate Leakage Reduction Strategy for
Future CMOS Circuits,” in European Solid-State Circuits Conference (ESSCIRC),
2003, pp. 317–320.
[10] N. S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin,
M. Kandemir, and V. Narayanan, “Leakage current: Moore’s Law Meets Static
Power,” IEEE Computer, vol. 36, no. 12, pp. 68–75, Dec. 2003.
[11] R. Chau, S. Datta, M. Doczy, J. Kavalieros, and M. Metz, “Gate dielectric scaling
for high-performance CMOS: from SiO2 to High-K,” in Extended Abstracts of
International Workshop on Gate Insulator (IWGI 2003), Nov. 2003, pp. 124–126.
60 CHAPTER 3. POWER DISSIPATION IN CMOS
[12] M. Draždžiulis, P. Larsson-Edefors, D. Eckerbert, and H. Eriksson, “A power
cut-off technique for gate-leakage supression,” in European Solid-State Circuits
Conference, Sept. 2004, pp. 171–174.
[13] International Technology Roadmap for Semiconductors, http://public.itrs.net,
ITRS, 2006.
[14] T. Sakurai, “Perspectives on Power-Aware Electronics,” in Digest of Technical
Papers, IEEE International Solid-State Circuits Conference, 2003, vol. 1, pp. 26–
29.
[15] T. Sakurai, “Minimizing Power Across Multiple Technology and Design Levels,”
in IEEE/ACM International Conference on Computer Aided Design, 2002, pp. 24–
27.
[16] V. Venkatachalam and M. Franz, “Power Reduction Techniques for Microproces-
sor Systems,” ACM Computing Surveys, vol. 37, no. 3, pp. 195–237, Sept. 2005.
[17] H. Kawaguchi et al., “A Super Cut-Off CMOS (SCCMOS) Scheme for 0.5-V
Supply Voltage With Picoampere Stand-By Current,” IEEE Journal of Solid-State
Circuits, vol. 35, no. 10, pp. 1498–1501, Oct. 2000.
[18] S. Mutoh et al., “1-V Power Supply High-Speed Digital Circuit Technology with
Multithreshold-Voltage CMOS,” IEEE Journal of Solid-State Circuits, vol. 30, no.
8, pp. 847–854, Aug. 1995.
[19] K.-S. Min et al., “Zigzag Super Cut-Off CMOS (ZSCCMOS) Block Activa-
tion with Self-Adaptive Voltage Level Controller: An Alternative to Clock-Gating
Scheme in Leakage Dominant Era,” in International Solid-State Circuits Confer-
ence (ISSCC), 2003, pp. 400–402.
[20] J. W. Tschanz et al., “Dynamic Sleep Transistor and Body Bias for Active Leakage
Power Control of Microprocessors,” IEEE Journal of Solid-State Circuits, vol. 38,
no. 11, pp. 1838–1845, Nov. 2003.
[21] M. Draždžiulis and P. Larsson-Edefors, “Evaluation of Power Cut-off Techniques
in the presence of Gate Leakage,” in Proceedings of the International Symposium
on Circuits and Systems (ISCAS), May 2004, pp. 475–478.
[22] K.-S. Min et al., “Zigzag Super Cut-Off CMOS (ZSCCMOS) Block Activa-
tion with Self-Adaptive Voltage Level Controller: An Alternative to Clock-Gating
Scheme in Leakage Dominant Era,” in Digest of Technical Papers of International
Solid-State Circuits Conference, 2003, pp. 400–402.
BIBLIOGRAPHY 61
[23] H. Kawaguchi et al., “A Super Cut-Off CMOS (SCCMOS) Scheme for 0.5-V
Supply Voltage With Picoampere Stand-By Current,” IEEE Journal of Solid-State
Circuits, vol. 35, no. 10, pp. 1498–1501, Oct. 2000.
[24] K. Flautner, Nam Sung Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy
Caches: Simple Techniques for Reducing Leakage Power,” in Proceedings of the
29st Anual International Symposium on Computer Architecture (ISCA), May 2002,
pp. 148–157.
[25] M. Powell, S. H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar, “Gated-Vdd: A
Circuit Technique to Reduce Leakage in Deep-Submicron Cache Memories,” in
Proceedings of the International Symposium on Low Power Electronics and Design
(ISLPED), jul 2000, pp. 90–95.
[26] A. Agarwal, H. Li, and K. Roy, “A Single Vth Low-leakage Gated-ground Cache
for Deep Submicron,” IEEE Journal of Solid-State Circuits, vol. 38, no. 2, pp.
319–328, Feb. 2003.
[27] F. Hamzaoglu, Y. Ye, A. Keshavarzi, K. Zhang, S. Narendra, S. Borkar, M. Stan,
and V. De, “Analysis of dual-VT SRAM cells with full-swing single-ended bit line
sensing for on-chip cache,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 10, no. 2, pp. 91–95, Apr. 2002.
[28] T. Douseki, N. Shibata, and J. Yamada, “A 0.5-1V MTCMOS/SIMOX SRAM
Macro with Multi-Vth Memory Cells,” in Proceedings of IEEE International SOI
Conference, Oct. 2000, pp. 24–25.
[29] C. H. Kim and K. Roy, “Dynamic Vth SRAM: A Leakage Tolerant Cache Memory
for Low-voltage Microprocessors,” in Proceedings of International Symposium on
Low Power Electronics and Design (ISLPED), Aug. 2002, pp. 251–254.
[30] C. H. Kim, J. J. Kim, S. Mukhopadhyay, and K. Roy, “A Forward Body-biased
Low-leakage SRAM Cache: Device and Architecture Considerations,” in Proceed-
ings of International Symposium on Low Power Electronics and Design (ISLPED),
Aug. 2003, pp. 6–9.
[31] K. Nii, H. Makino, Y. Tujihashi, C. Morishima, Y. Hayakawa, H. Nunogami,
T. Arakawa, and H. Hamano, “A Low power SRAM using Auto-backgate-
controlled MT-CMOS ,” in Proceedings of International Symposium on Low Power
Electronics and Design (ISLPED), Sept. 1998, pp. 293–298.
62 CHAPTER 3. POWER DISSIPATION IN CMOS
[32] K. Flautner et al., “Drowsy Caches: Simple Techniques for Reducing Leakage
Power,” in ISCA 2002, May 2002, pp. 148–57.
[33] N. Mohyuddin et al., “Controling Leakage Power with the Replacement Policy in
Slumberous Caches,” in CF 2005, May 2005, pp. 161–70.
4Cache Power Modeling – Tool
Perspective
This chapter provides a review of existing power estimation-performance anal-
ysis tools for microprocessors (Section 4.1) and information of some existing
power estimation tools for on-chip caches (Section 4.2). Finally, Section 4.3
presents some background information of power modeling in general, its classi-
fication and the area of applications. Also, detailed descriptions of some power
models used in several existing power estimation-performance analysis tools
are given in this section.
63
64 CHAPTER 4. CACHE POWER MODELING – TOOL PERSPECTIVE
4.1 Architecture-level Performance Simulator and
Power Dissipation Estimator: a Survey of
Existing Tools
During the past decade, a fair amount of research effort has been directed to-
wards developing tools for superscalar microprocessor and for multiprocessor
systems. Examples of performance analysis tools1 include Simics [1], SimOS [2],
SimpleScalar [3], HydraScalar [4], and RSIM [5].
Simics and SimOS are complete system simulation platforms that can func-
tionally model the execution of complex software systems on the instruction-set
architectural abstraction level. They are designed to boot and run commercial
unmodified operating systems, with realistic workloads, and can simulate sev-
eral types of superscalar microprocessors at the instruction-set level, including
the full supervisor state [1].
SimpleScalar is a powerful microarchitectural simulation infrastructure that
has the capability of modeling a whole range of superscalar microarchitec-
tural designs. It can model a variety of platforms ranging from simple un-
pipelined processors to the detailed dynamically scheduled microarchitectures
with multiple-level memory hierarchies. It has fairly small code sizes and offers
a documented and well-structured design [3].
HydraScalar is an expanded version of SimpleScalar (version 2.0) to accu-
rately model a wide-issue, out-of-order execution multipath superscalar proces-
sor [4].
RSIM is an execution-driven simulator that simulates a variety of shared-
memory ILP superscalar multiprocessors (and uniprocessor) architectures con-
figurations. It can model state-of-the-art instruction-level parallism (ILP) mul-
1They are also referred to as performance simulators.
4.1. A SURVEY OF EXISTING POWER-PERFORMANCE TOOLS 65
tiprocessors, a high-performance memory system, and a multiprocessor coher-
ence protocol and interconnect, including contention at all resources [5].
Together with these performance analysis tools, several power dissipation es-
timation tools for superscalar processors have also been designed, including
Wattch [6], SimplePower [7], TEM2P2EST [8], AccuPower [9], HotLeakage [10],
and PowerTimer [11].
Wattch is an architecture-level power dissipation estimator whose princi-
ple is based on a suite of parameterizable power models for different hard-
ware structures and on per-cycle resource usage counts generated through a
cycle-level simulator. Basically, Wattch is built upon the SimpleScalar (version
3.0) out-of-order simulator that has been extended conceptually from a 5-stage
pipeline to an 8-stage pipeline [6].
TEM2P2EST is another power dissipation estimator that is built upon the
SimpleScalar (version 2.0) out-of-order simulator. The main differences be-
tween the two power dissipation estimators are their power models for estimat-
ing active power dissipation. Neither simulators estimate static power dissipa-
tion, but only assume that static power dissipation is about 10% of active power
dissipation [8].
SimplePower is an execution-driven, cycle-accurate RTL power estimation
tool that is used in evaluating algorithmic, architectural and compiler optimiza-
tions. SimplePower is based on the architecture of a simple 5-stage pipelined
datapath and simulates only the integer subset of the instruction set of Sim-
pleScalar [7]. It simulates the executables, which are converted from bench-
mark programs by using the SimpleScalar compiler toolset, providing cycle-
by-cycle energy estimates and switch capacitance statistics for the processor
datapath, memory and on-chip buses [12].
66 CHAPTER 4. CACHE POWER MODELING – TOOL PERSPECTIVE
AccuPower is a power estimation tool that uses a true hardware-level and
cycle-level microarchitectural simulator and energy/power dissipation coeffi-
cients taken from SPICE data of actual CMOS layouts of critical datapath com-
ponents to obtain an accurate estimation of power dissipation for superscalar
microprocessors with several variants of superscalar datapath. AccuPower is a
greatly modified version of the SimpleScalar simulator, especially the Register
Update Unit in the datapath, to mimic an actual hardware implementation of
modern superscalar microprocessors [9].
HotLeakage is a micro-architectural simulation tool based on Wattch and
the Cache-decay simulator. In this work, Parikh et al. [10] developed an archi-
tectural model for subthreshold and gate leakage that explicitly captures temper-
ature, voltage, and parameter variations. This was an attempt to further develop
the methodology of Butts and Sohi [13] to address the effect of temperature on
leakage power dissipation.
Besides the public tools, there also exist accurate power estimation tools
that are available within the organizations of individual microprocessor ven-
dors for specific architectures. Examples of these tools include IBM’s MET
and its associated power estimation components PowerTimer for a PowerPC
implementation [11], and Compaq’s ASIM for simulating and estimating tran-
sition activity within Alpha processor implementations [14]. Nevertheless, in
the following paragraphs, a brief review of those tools is given to provide a
complete overview of existing power-performance estimation tools.
Based on publicly available documents released by IBM [11] [15], Pow-
erTimer is a toolset developed for use in early-stage, microarchitecture-level
power performance analysis of microprocessors. It includes a parameterized
microarchitecture evaluation toolset (MET), a cycle-accurate performance sim-
ulator (Turandot) within the MET and research microarchitecture power mod-
els (RMAP). For general research studies, the Turandot/MET is used to read
instructions from a program’s executable code, or from its traces, and then sim-
4.1. A SURVEY OF EXISTING POWER-PERFORMANCE TOOLS 67
ulate the timing flow within the targeted processor. All timing issues such as
pipeline latencies, stall/flush occurrences, etc. are modeled as accurate as pos-
sible providing Turandot/MET an ability to generate an accurate performance
figure (in processor cycles) of the given input program. For designing a new
PowerPC processor a baseline cycle-accurate performance simulator is selected
accordingly. Those cycle-accurate performance simulators are properties of
IBM and are not publicly available.
The microarchitecture-level energy models used in PowerTimer are derived
based on either (i) energy characterization data obtained by using low-level
circuit- and simulation-based research tools (e.g. circuit-simulation-based, or
RTL-simulation-based, or actual hardware-measurement-based tools) for avail-
able components of previous designs; (ii) analytical models built for charac-
terizing the power on the basis of the implementation structure of each mi-
croarchitectural entity or event (at the gate-level or circuit-level with or without
interconnect effects). Those energy models are implemented in C as energy
functions and called RMAP.
The PowerTimer toolset is currently in use to provide early-stage power
performance analysis and microarchitecture definition of high-end, general pur-
pose IBM PowerPC processors [15]. This toolset is not accessible outside the
IBM corporation.
ASIM is a performance modeling framework used in the Compaq (formerly
Digital) processor design team to simulate and predict performance of Alpha
processors [16]. ASIM consists of collections of modules (implemented in
C++) where each of them represents a physical component of a targeted pro-
cessor or captures a hardware algorithm’s operation. Each ASIM module is
designed as a software component providing a well-defined interface for users
(or developers) to reuse modules in different contexts or replace them with other
modules implementing a different algorithm for the same function. Each mod-
ule interface uses method calls to communicate between a module and its em-
bedded sub-modules, and uses ports to provide communication and timing be-
tween modules. Using this framework, several performance models for unipro-
68 CHAPTER 4. CACHE POWER MODELING – TOOL PERSPECTIVE
cessors, vector processors and chip multiprocessors, etc. are developed. In
order to get the performance figure of a microprocessor, users need to create a
performance model for it and then run ASIM on that performance model with
a program or benchmark. Since ASIM is used mainly for simulating and esti-
mating performances of Alpha processors, it does not provide any information
of the power consumed by those processors, and, moreover, it is not accessible
outside the Compaq corporation.
All of the above mentioned performance-power estimation tools are de-
signed to estimate power dissipation and performance of superscalar single and
multi-processor architectures, but none of them are dedicated for single or par-
allel DSP architectures. Clearly, there is a lack of efficient Power-Performance
Simulator/Estimators for DSP parallel architectures. This is the area in which
a power-performance simulator for DSPs (e.g. the DSP-PP simulator which is
presented in Chapter A) is intended to contribute.
4.2 High-Level Power Estimation Tools for Caches
During the past decade, some research works have also been directed towards
developing analytical models for estimating dynamic and static power dissipa-
tion for SRAM-based caches and SRAM arrays at the architecture-level, how-
ever only a few power models have been made publicly available.
CACTI is one of the most widely used power estimation tools in the public
domain [17]. It offers analytical timing and energy models for un-partitioned
and partitioned on-chip caches. In its previous versions 1.0, 2.0 and 3.2, CACTI
used only ideal first-order scaling for technology trends. Further, it did not
include any leakage power models.
The recently released CACTI version (4.0 [18]) is updated with respect
to basic circuit structures, to device parameters for an improved technology
scaling, and to leakage models, in that a model based on Hotleakage [19] and
eCACTI [20] is added. However, the added model still fails to accurately ac-
4.3. POWER DISSIPATION ESTIMATION MODELS 69
count for small-channel effects, gate leakage, and terminal voltage dependen-
cies in transistor stacks—the model error in estimating leakage power dissipa-
tion was claimed to be below 21.5% [20].
Zeng et al. developed the Predictor of Access and Cycle Time for Cache
Stack (PRACTICS) tool [21] that uses analytical models (i.e. similar to CACTI)
to determine an optimal design for partitioned caches by exhaustive compari-
son of alternative memory configuration parameters. Although PRACTICS pro-
vides more accurate estimates of interconnect effects in comparison to CACTI 3.2,
it still does not include power models for leakage estimation, and therefore has
limited accuracy in estimating total power dissipation.
4.3 Power Dissipation Estimation Models
4.3.1 High-level Power Dissipation Estimation
Methodology
In general, architecture-level power dissipation estimation methods can be clas-
sified into two groups: Analytical (statistical) and Simulation-based.
Analytical power estimation models have been used in several projects, e.g.
[6] [13] [10] and [22]. The advantage of the analytical model is the simplicity
of the formulas used to calculate the dynamic and leakage power dissipation
estimates. These simple formulas allow architects to rapidly obtain the power
estimates and consider power characteristics of alternative designs. However,
analytical models usually offer low accuracy compared to the estimates from
circuit-level power estimation tools like SPICE and its clones. Moreover, due
to the simplicity of the formulas, analytical models may not cover the complete
deep-submicron behavior of MOS transistors and wiring, causing an unaccept-
able decrease in accuracy [23].
70 CHAPTER 4. CACHE POWER MODELING – TOOL PERSPECTIVE
In contrast to the analytical approach, the simulation-based power estima-
tion methods offer very accurate power estimates at the price of long estima-
tion run-time. The simulation-based power estimation methods can be imple-
mented by table-based or equation-based power models. The difference be-
tween equation-based and table-based models is that the former ones are "dis-
crete" tabulated power dissipation values, while the latter ones are mathematical
equations resulting from "generalization" of those "discrete" power estimates
using curve-fitting techniques, e.g. linear and non-linear regressions. A more
detailed description of analytical, table-based and equation-based power models
are given in the next section.
4.3.2 Analytical Models
Some research have been directed towards developing analytical models for
estimating dynamic and static power dissipation at the architectural design level.
The first two models, the Cai-Lim and the Wattch are fundamentally similar
relying on activity-based power models to estimate dynamic power dissipation
[24].
Cai-Lim Power Estimation Models
The Cai-Lim power model is an activity-sensitive power model built on the Sim-
pleScalar 2.0 out-of-order simulator [25]. It partitions the basic SimpleScalar
architecture into 17 hardware structures that are further subdivided into a total
of 32 physical blocks. Each physical block is then further divided into power
density and area for both active and inactive contributions from dynamic, static,
programmable logic array (PLA), clock, and memory sections of the block.
Area estimates are based on publicly available designs with additional area al-
located for clocking, interconnects, and power supply. Active circuit power
density is estimated from SPICE simulations of typical designs based on Tai-
wan Semiconductor Manufacturing Corporation (TSMC) 0.25-µm process files.
Then, power density numbers are used as constants in conjunction with the ac-
4.3. POWER DISSIPATION ESTIMATION MODELS 71
tivity counters to model power dissipation. The basic power estimation formu-
las are as follows:
Overall Power Dissipation:
Pc = Pactive + Pstatic ≈ Pdynamic + Pleakage
Pc =∑
i
{EAF ∗∑
m
(EA ∗ APD)m}i
+∑
i
{(1 − EAF ) ∗∑
m
(EA ∗ IPD)m}i (4.1)
Dynamic Power Dissipation:
Pdynamic =∑
i
{Power(active)i}
=∑
i
{EAF ∗∑
m
(EA ∗ APD)m}i (4.2)
Static Power Dissipation:
Pleakage =∑
i
{Power(inactive)i}
=∑
i
{(1 − EAF ) ∗∑
m
(EA ∗ IPD)m}i (4.3)
Here, EAF is the effective activity factor, EA is the effective area, APD is
the active power density, IPD is the inactive power density, i is the number of
cycles, and m is the circuit type.
The Cai-Lim model tracks how a hardware structure is used by breaking
it down into different types of accesses and then counting each time that type
of access occurs during a cycle. This structural breakdown and its associated
information provide an opportunity for detailed modeling and ability to track re-
duction in dynamic activity [24]. All values for power densities and areas have
been pre-computed and included as part of the source code of the power estima-
tor. Cai-Lim does not claim any specific accuracy, but in general an accuracy of
75% of layout-level power tools is expected [24].
72 CHAPTER 4. CACHE POWER MODELING – TOOL PERSPECTIVE
Wattch Power Estimation Models
Wattch is a collection of power models. Wattch divides a main microproces-
sor units into four categories as array structures (including data and instruction
caches, cache tag array, all register files, register alias table, branch predictors
and large portions of the instruction window, and load/store queue), content-
addressable memories (including instruction window/reorder buffer wakeup logic,
load/store order checks, and TLBs), combinational logics and wires (including
functional units, instruction window logics, and result busses), and clocking
(clock buffers, clock wires and capacitive loads). Wattch uses power models
for these basic components, where one of them is an "all components always
on" model and the remaining three models are activity sensitive with varying
degrees of conditional clocking enabled [6]. The basic power estimation for-
mulas are as follows:
Overall Power Dissipation:
Pc = Pactive + Pstatic ≈ Pdynamic + Pleakage (4.4)
Dynamic Power Dissipation:
Pdynamic = a C V 2dd f (4.5)
Static Power Dissipation: assumed to be 10% of Pdynamic
Activity factors a for each certain critical sub-circuits are obtained from bench-
mark programs using an architectural simulator, the SimpleScalar. Otherwise,
a = 1, for circuits that precharge and discharge on every cycle; a = 0.5, for
sub-circuits where the activity can not be simulated. Supply voltage Vdd and
clock frequency f are taken from the assumed 0.35-µm process technology.
The load capacitance C is estimated based on the circuit and the transistor siz-
ing using the formulas shown in Table 4.1 [6].
Wattch claims an accuracy within 10% of layout-level power tools and pro-
vides validation results that indicate an average accuracy of ± 13% when com-
paring relative power against known relative powers for implemented architec-
4.3. POWER DISSIPATION ESTIMATION MODELS 73
Table 4.1: The equations for capacitance of critical nodes
Categories / Nodes Capacitance Equations
Register = Cdiff (WordLineDriver)
Array files +Cgate(CellAccess) ∗ NumBitlines
Structure Wordline +Cmetal ∗ WordLineLength
Register = Cdiff (Precharge)
files +Cdiff (CellAccess) ∗ NumWordlines
Bitline +Cmetal ∗ BitLineLength
CAM = Cgate(CompareEnable) ∗ NumberTags
CAM Tagline +Cdiff (CompareDriver)
Structure +Cmetal ∗ TagLineLength
CAM = 2 ∗ Cdiff (CompareEnable) ∗ TagSize
Matchline +Cdiff (MatchPrecharge)
+Cdiff (MatchOR)
+Cmetal ∗ MatchLineLength
Complex Result = 0.5 * Cmetal ∗ NumALU ∗ (ALUHeight)
Logic Bus +Cmetal(RegisterF ileHeight)
Blocks
tures (Pentium Pro and Alpha 21264) [6] and [24]. Wattch uses technology
scaling factors included for processes ranging from 0.1-µm to 0.8-µm in its
power models.
Both Wattch and Cai-Lim power models are based on the SimpleScalar
toolset that is commonly used to model microarchitectures in educational and
some research environments. They are fairly flexible and acceptably accurate
for processes technologies of 0.25-µm and 0.35-µm. However, they still have
some shortcomings: The lack of directly accessible details on scaling factors
limits Cai-Lim model’s ability to be used to directly compute relative contri-
butions to power from different blocks. The model is also difficult to extend
74 CHAPTER 4. CACHE POWER MODELING – TOOL PERSPECTIVE
without examining the original process files to determine how to incorporate
new hardware structures. Wattch provides for greater access to the underlying
details of the models than the Cai-Lim model. Counters for different types of ac-
cesses are employed, but many details are still left out. This lack of granularity
in access counting limits Wattch’s ability to identify activity reduction power
savings. In addition, Wattch does not estimate the inactive power dissipation
due to subthreshold leakage current, but simply assumes that its contribution is
just 10% of the active power. These models, therefore, have limited accuracy
and lack scalability to future technology processes.
Butts-Sohi Static Power Models
Butts and Sohi [13] proposed a generic, high-level model for micro-architecture
components. The model is based on a key design parameter, Kdesign, capturing
device type, device geometry and stacking factors that can be obtained based
on simulations. Its model of subthreshold leakage accurately addresses some
different issues affecting static power in such a way that it makes it easy to rea-
son about leakage effects at the micro-architectural level. However, it turns out
not to be well suited for some types of SRAM circuits with power-saving and
leakage-reduction techniques like MT-CMOS, Gated-Vdd, and Drowsy Cache.
Also, it was never released as publicly available software.
A Temperature-Aware Static Power Model (HotLeakage)
Parikh et al. [10] developed an architectural model for subthreshold and gate
leakage that explicitly captures temperature, voltage, and parameter variations.
This model was implemented in the micro-architectural HotLeakage simulation
tool based on Wattch and the Cache-decay simulator. This was an attempt to
develop the methodology of Butts and Sohi to address the effect of temperature
on leakage power dissipation. However, the accuracy of the leakage power
estimation for any complex circuit structures like memory arrays, caches, etc.,
is unknown.
4.3. POWER DISSIPATION ESTIMATION MODELS 75
An enhanced CACTI (eCACTI)
Another effort to develop further the methodology of Butts and Sohi is the work
by Mahesh et al. given in [22] [26] and [27]. In this work, the authors developed
analytical models parameterized in terms of high-level design parameters to
estimate leakage power in SRAM arrays. An error margin of "less than 23.9%"
compared to HSPICE power values is achieved by this method. These analytical
models are then implemented in an architecture-level power tool for SRAM
arrays, called as the eCACTI [20].
Research Microarchitecture Power Models (RMAP) of PowerTimer
Microarchitecture-level energy models used in PowerTimer can be derived based
on either (i) energy characterization data obtained by using a low-level circuit-
and simulation-based research tool (i.e. CPAM [28]) for components of previ-
ous designs; (ii) analytical models built for characterizing power on the basis of
the implementation structure of each microarchitectural entity or event.
Actually, RMAP consists of energy models implemented in C, which are
derived by using several methodological paths: (i) model formulation is based
on the unit-level and pipeline stage-level latch counts (called latch-based energy
models) that are estimated either from logic-level bit specifications of individual
functions or from area and latch-density of prior designs; (ii) model formulation
is based on detailed macro-level power simulation data that is available from
prior processor projects, and a utility script used to convert those data into high-
level, unit-specific energy functions; (iii) when detailed circuit schematics are
available, model formulation is based on low-level energy data generated by
CPAM for those circuits. Energy models for each microarchitecture block are
then formulated by collecting and abstracting those obtained energy data.
4.3.3 Table-based and Equation-based Models
Schmidt et al. [29] developed an automatic black box memory-modeling ap-
proach based on nonlinear regression, which intends to combine good model
76 CHAPTER 4. CACHE POWER MODELING – TOOL PERSPECTIVE
properties (i.e. accuracy, speed, etc.) with good modeling properties (i.e. au-
tomatism, adaptability to design flow, low overhead and IP protection). Never-
theless, this approach offers advantages at the price of a complex and compu-
tationally expensive model characterization phase. For typical memory arrays
whose regular internal structures are known and can easily be analyzed, a white
box modeling approach (e.g. our approach [30]) can be a good alternative to
the black box one, offering a simpler and faster model characterization phase.
Our approach is described in details in Chapter 5.
In [31] Eckerbert et al. presented a methodology to accurately estimate to-
tal power dissipation (including static power) at the RT-level using simulation-
based power estimation models. The methodology takes into account the changes
in the component environment, which occur between characterization and esti-
mation. By separating the different power dissipation mechanisms this method-
ology achieves high degrees of accuracy in estimating power dissipation. Al-
though it is a complex and accurate RT-level simulation approach and mainly
focuses on estimating total power dissipation of complex components, such as
arithmetic-logic circuits, it still serves as a good hint for us. Furthermore, this
methodology can be used together with our proposed approach, where neces-
sary (e.g. ALUs, MACs, etc.), providing an architecture-level solution to the
problem of estimating total power dissipation of all processor components.
Bibliography
[1] S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg,
F. Larsson, A. Moestedt, and B. Werner, “SIMICS: A Full System Simulation
Platform,” IEEE Transaction on Computers, pp. 50–58, Feb. 2002.
[2] M. Rosenblum, E. Bugnion, A. Herrod, and S. Devine, “Using the SimOS Machine
Simulator to Study Complex Computer Systems,” ACM Transactions on Modeling
and Computer Simulation, vol. 7, no. 1, pp. 78–103, Jan. 1997.
[3] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An Infrastructure for Computer
System Modeling,” IEEE Transactions on Computers, pp. 59–67, Feb. 2002.
BIBLIOGRAPHY 77
[4] K. Skadron and Pritpal S. Ahuja, “HydraScalar: A Multipath-Capable Simulator,”
Newsletter of the IEEE Technical Committee on Computer Architecture, pp. 65–70,
Jan. 2001.
[5] C. Hughes, V. Pai, P. Ranganathan, and S. Adve, “RSIM: simulating Shared-
Memory Multiprocessors with ILP Processors,” IEEE Transaction on Computers,
pp. 40–49, Feb. 2002.
[6] D. Brooks, V. Twari, and M. Martonosi, “Wattch: A Framework for Architectural-
Level Power Analysis and Optimizations,” in Proceedings of the Anual Interna-
tional Symposium on Computer Architecture, June 2000, pp. 83–94.
[7] N. Vijaykrishnan, M. Kandermir, J. Irwin, H. Kim, and W. Ye, “Energy-Driven
Hardware-Software Optimizations Using SimplePower,” in Proceedings of the
Anual International Symposium on Computer Architecture, June 2000, pp. 95–106.
[8] A. Dhodapkar, C. Lim, G. Cai, and R. Daasch, “TEM2P2EST: A Thermal Enabled
Multi Model Power/ Performance ESTimator,” in Proceedings of the Workshop on
Power-Aware Computer Systems, Nov. 2000, pp. 112–125.
[9] D. Ponomarev, K. Gurhan, and K. Ghose, “AccuPower: An accurate Power Esti-
mation Tool for SuperScalar Microprocessors,” in Proceedings of the 5th Design
Automation and Test in Europe Conference, Mar. 2002, pp. 124–129.
[10] D. Parikh et al., “Comparison of State-Preserving vs. Non-State-Preserving Leak-
age Control in Caches,” in Proceedings of the Workshop on Duplicating, Decon-
structing and Debunking (held in conjunction with ISCA), June 2003, pp. 14–25.
[11] D. Brooks, J. Wellman, P. Bose, and M. Martonosi, “Power-Performance Modeling
and Tradeoff Analysis for a High-End Microprocessor,” in Proceedings of the
Workshop on Power-Aware Computer Systems, Nov. 2000, pp. 126–136.
[12] W. Ye, N. Vijaykrishnan, M. Kandermir, and J. Irwin, “The design and use of
SimplePower: a cycle accurate energy estimation tool,” in Proceedings of the
Design Automation Conference, June 2000, pp. 340–345.
[13] J. A. Butts and G. S. Sohi, “A Static Power Model for Architects,” in Proceedings
of the International Symposium on Micro-architectures, Dec. 2000, pp. 191–201.
[14] The ASIM Manual, Compaq Computer Corporation, 2000.
[15] D. Brooks, P. Bose, V. Srinivasan, M. K. Gschwind, P. G. Emma, and M. G.
Rosenfield, “New Methodology for early-stage, microarchitecture-level Power-
78 CHAPTER 4. CACHE POWER MODELING – TOOL PERSPECTIVE
performance analysis of microprocessors,” IBM Journal of Research & Devepol-
ments, vol. 47, no. 5, pp. 653–670, Sept. 2003.
[16] J. Emer, P. Ahuja, E. Borch, A. Klauser, C. Luk, S. Manne, S. Mukherjee, H. Patil,
S. Wallace, N. Binkert, R. Espasa, and T. Juan, “ASIM: A Performance Model
Framework,” IEEE Transaction on Computers, pp. 68–76, Feb. 2002.
[17] S.J.E. Wilton et al., WRL 93/5: An Enhanced Access and Cycle Time Model for
On-chip Caches, WRL, 1994.
[18] D. Tarjan et al., HPL 2006-86: CACTI4.0, HP, 2006.
[19] Y. Zhang et al., CS 2003-05: HotLeakage : A Temperature-Aware Model of Sub-
threshold and Gate Leakage for Architects, Dept. of CS, Univ. of Virginia, USA,
2003.
[20] M. Mamidipaka et al., CECS 04-28: eCACTI: An Enhanced Power Estimation
Model for On-chip Caches, CECS, Univ. of California, Irvine, USA, 2004.
[21] A. Y. Zeng et al., “Cache Array Architecture Optimization at Deep Submicron
Technologies,” in ICCD 2004, Oct. 2004, pp. 320–5.
[22] M. Mamidipaka et al., “IDAP: A Tool for High-Level Power Estimation of Custom
Array Structures,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 23, no. 9, pp. 1361–1369, September 2004.
[23] M. Q. Do and L. Bengtsson, “Analytical models for power consumption estima-
tion in the dsp-pp simulator: Problems and solutions, technical report no. 03-22,”
Tech. Rep., The Department of Computer Engineering, Chalmers University of
Technology, Göteborg, Sweden, 2003.
[24] S. Ghiasi and D. Grunwald, “A Comparison of Two Architectural Power Models,”
in Proceedings of the Workshop on Power-Aware Computer Systems, Nov. 2000,
pp. 137–152.
[25] G. Cai and C. H. Lim, “Architectural Level Power/ Performance Optimization and
Dynamic Power Estimation,” in Proceedings of Cool Chips Tutorial, Nov. 1999,
pp. 90–113.
[26] M. Mamidipaka, K. Khouri, N. Dutt, and M. Abadir, “A methodology for accu-
rate modeling of energy dissipation in array structures,” in Proceedings of 16th
International Conference on VLSI Design, Jan. 2003, pp. 320–325.
BIBLIOGRAPHY 79
[27] M. Mamidipaka et al., “Leakage power estimation in srams, technical report no.
03-32,” Tech. Rep., Center for Embedded Computer Systems, University of Cali-
fornia, Irvine, USA, 2003.
[28] J. S. Neely, H. H. Chen, S. G. Walker, J. Venuto, and T. J. Bucelot, “CPAM:
A common power analysis methodology for high-performance VLSI design,” in
Proceedings of 9th Topical Meeting on Electrical Performance of Electronic Pack-
aging, Oct. 2000, pp. 303–306.
[29] E. Schmidt et al., “Memory Power Models for Multilevel Power Estimation and
Optimization,” IEEE Transaction on VLSI Systems, vol. 10, pp. 106–109, Apr.
2002.
[30] M. Q. Do, P. Larsson-Edefors, and L. Bengtsson, “Table-based Total Power Con-
sumption Estimation of Memory Arrays for Architects,” in Proceedings of Inter-
national Workshop on Power and Timing Modeling, Optimization and Simulation
(PATMOS’04), LNCS 3254, Sept. 2004, pp. 869–878.
[31] D. Eckerbert and P. Larsson-Edefors, “A Deep Submicron Power Estimation
Methodology Adaptable to Variations Between Power Characterization and Es-
timation,” in Proceedings of the 2003 Asia-South Pacific Design Automation Con-
ference, Jan. 2003, pp. 716–719.
80 CHAPTER 4. CACHE POWER MODELING – TOOL PERSPECTIVE
Part III
Power Modeling for
SRAM-based Structures
5Modular Approach to Power
Modeling for On-Chip Caches
This chapter describes the work done on power modeling methodology for on-
chip caches. First, Section 5.1 in detail shows drawbacks of an analytical ap-
proach to power modeling, and the reason why a table-based simulation-based
power modeling approach has been selected. After that, the proposed modular
hybrid power estimation modeling methodology for on-chip caches and SRAM
data arrays is described in detail in Section 5.2. Section 5.3 is dedicated for de-
scribing a probing methodology to correctly capture the total leakage currents of
sub-90nm logic circuits when circuit simulators, such as Hspice, are employed.
Section 5.4 presents power dissipation estimation models for on-chip caches
including power models for tag SRAM-based and data SRAM arrays. Sec-
83
84 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
tion 5.5 is dedicated to validation of the obtained power models against circuit-
level simulations for a complete on-chip cache, a physically partitioned and an
unpartitioned SRAM arrays. Finally, in Section 5.6, the modeling methodol-
ogy to capture the dependence of leakage power on temperature variation, on
supply-voltage scaling, and on the selection of process corners is presented and
discussed in detail.
5.1 Analytical Approach to Power Modeling and
Its Induced Problems
As mentioned earlier in Section 1.3, the analytical approach is a straight-forward
way to model MOS transistor’s leakage mechanisms. The complexity of equa-
tions defines the accuracy of the approach in estimating leakage power. The
BSIM4 models describe leakage mechanisms using very detailed and com-
plex equations [1], for example, the BSIM4 models define the subthreshold
leakage current for a single MOS transistor using Eqs 3.2 - 3.7 given in Sec-
tion 3.1. Although BSIM4 models offer high accuracy in estimating leakage
power accounting for different variations in temperature, in threshold voltage, in
technology-related parameters, etc. they are obviously not suitable for higher-
level power estimation due to their complex relations and equations that require
the user to have deep knowledge of device models and access to detailed pro-
cess parameters.
Several studies have been directed to develop analytical architecture-level
leakage power models with support for supply-voltage scaling and tempera-
ture variation, based on a simplified version of BSIM3 and BSIM4 models for
subthreshold leakage current, i.e. [2] and [3], respectively. These represent
attempts to simplify a BSIM model to a less complex model, that is intended
for use in high-level power estimation tools, by introducing curve-fitting co-
efficients and circuit-dependent empirical constants fixed for each particular
process technology. The recently released CACTI version (4.0 [4]) has been
5.1. ANALYTICAL POWER MODELING APPROACH & PROBLEMS 85
updated with a leakage model based on Hotleakage [2] and eCACTI [5] to offer
a rudimentary ability to estimate leakage power with supply-voltage scaling and
temperature variation over a set of typical technology nodes.
30 40 50 60 70 80 90 100 11010
−9
10−8
10−7
10−6
10−5
Temperature ( oC)
I su
b (
A)
BSIM3
CACTI4
Figure 5.1: Subthreshold leakage power with different temperature for an NMOS tran-
sistor (commercial 130-nm process)
The concept of a technology node is however gradually being abandoned
(ITRS’05 [6]). Already today the notion of having one single typical process
to represent a "technology node" yields large estimation errors for static-power
dominated memories, since process technologies within a classical technology
node can be so different. With further technology scaling, the diversity in pro-
cess technology offerings will probably increase significantly, thus exacerbating
the problem.
Fig. 5.1 shows subthreshold leakage power (log scale) for a minimum-sized
130-nm NMOS transistor for a range of different temperatures. The power val-
ues obtained by using a BSIM3 model (dotted line) are approximately 250×
86 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
smaller than the values obtained by using the 130-nm leakage model imple-
mented in CACTI 4.0. This serves to illustrate the drawbacks of simplifying a
set of analytical leakage models: inaccuracy and inflexibility. Clearly, if leak-
age power models at architectural level are to guide design trade-offs, they can
not be based on generic process parameters, but they must be calibrated to the
actual target process(es).
5.2 The Proposed Modular Hybrid Power Estima-
tion Modeling Approach
In general, as mentioned in Section 4.3.1, architecture-level power dissipation
estimation methods can be classified into two groups: Analytical (statistical)
and simulation-based. While the analytical estimation method uses mathemati-
cal formulas, the simulation-based power estimation methods are implemented
by either table-based or equation-based power models.
The proposed power estimation modeling methodology for SRAM-based
caches is a hybrid one, i.e. rather than using only one technique to estimate
power dissipation, the methodology seeks to find the best match between a par-
ticular estimation technique and a specific cache component. Fig. 2.11 shows
the organization of a typical SRAM-based cache that is divided into two ar-
rays: tag and data arrays. The tag array consists of the SRAM-based array, the
column multiplexers, the tag sense amplifiers, the tag writing circuits, the tag
wordline drivers, the comparators, the MUX-drivers, etc. The data array con-
sists of the SRAM data array, the data wordline/bitline drivers, the data sense
amplifiers, the data writing circuits, the data multiplexers, the output drivers,
etc. The row/column decoders are shared between two arrays. For each type of
cache components, based on its structure (since this is a white-box approach)
an analysis is performed to define the major mechanisms of power dissipation.
Then, based on the result of this analysis, the appropriate power estimation tech-
niques are selected. For example, a probabilistic approach has been used to esti-
5.2. THE PROPOSED MODULAR MODELING APPROACH 87
mate both dynamic and static power of address decoders, an analytical approach
has been used to estimate dynamic power of bitlines and 6T-SRAM cells, sense
amplifiers, write circuits, and wordline drivers, while a circuit-simulation-based
modeling backend has been used to estimate all leakage power mechanisms.
Figure 5.2: Power modeling methodology: a) Component Characterization Phase, and
b) Power Estimation Phase
At a close look, the power estimation modeling approach for SRAM-based
caches consists of two underlying phases: Component Characterization and
Power Estimation. (See Figs 5.2a and 5.2b) [7].
1. Component Characterization: takes as inputs the netlist of a typical
cache component, its states (i.e. Read, Write, Leak) and memory-orga-
nization parameters, generates leakage power values by performing few,
simple circuit-level DC simulations using the appropriate probes, and tab-
ulates those values into the pre-characterized leakage tables. The inde-
pendent inputs to the pre-characterized leakage tables are Type of compo-
nent (i.e. type of cache component), Component State (S), Temperature
(T ), Frequency (F ), Threshold Voltage (VT ), Supply Voltage (VDD) and
Process Corner (PV ). The power values in those tables are the per-cycle
88 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
leakage power dissipation of that component. In addition, for each cache
component, the nodal capacitances are also extracted using a circuit-level
simulator that establishes the operating point and DC capacitances.
2. Power Estimation: takes as inputs the pre-characterized leakage tables,
states of the component, input traces (i.e. a sequence of accesses like
{Read, Write, Write, Read, Write, Read, Leak, etc.}), and produces
power dissipation estimates in a cycle-by-cycle manner. For each cache
component, its power model for total power estimation consists of ana-
lytical equations for dynamic power and pre-characterized leakage power
values. Dynamic analytical power models are derived based on the well-
known activity-based switching power equation (Eq. 3.1) with nodal ca-
pacitances extracted during the Component Characterization phase. The
total leakage power accounts for all types of leakage currents that are
present in the transistor models used by the circuit-level simulator, dur-
ing both idle and active cycles. Total power dissipation of the component
is the sum of dynamic and leakage power dissipation values.
For any cache component the Component Characterization phase is typi-
cally performed only once by a cell-library designer. Computer architects hav-
ing access to the netlist of new components also can perform the characteriza-
tion of their components and create new tables for later use. When the char-
acterization is done, the pre-characterized leakage power values and the values
of the extracted nodal capacitances are tabulated for later use in the power es-
timation phase, and no further simulations are needed until the structure of the
component is modified. Therefore, as compared to those high-level analytical
power models implemented in existing power estimation tools, the proposed
power models offer much better accuracy and flexibility in estimating both to-
tal and leakage power dissipation for on-chip caches, requiring much less time
for the component characterization phase. Furthermore, the proposed model-
ing methodology is modular, thus, it can be applied to model power dissipation
for other types of components of regular structures, e.g. content-addressable-
memory (CAM).
5.3. PROBING METHODOLOGY FOR LEAKAGE 89
5.3 Probing Methodology for Leakage
In submicron CMOS processes, other leakage mechanisms than subthreshold
leakage become significant and, therefore, a systematic probing methodology
is essential to obtain accurate power estimates. The reason why leakage-current
probing of very deep submicron circuits is complex is that currents no longer
only flow through the transistor channel. Rather, probes need to be applied so
that input and output circuit interfaces, through which significant currents flow,
can be captured. Since the proposed power estimation methodology is used
to calculate total power from the power of many regularly assembled memory
cells, the overall accuracy is very dependent on cell interface currents. In this
section, a methodology for probing circuits for static current measurements in
CMOS circuits during simulation is presented. The methodology is capable
of capturing all leakage mechanisms existing in BSIM4 models, in this case
implemented in the Hspice simulator. The full description of the methodology
together with some illustrative examples and a survey of related works are given
in [8].
S
G
D
B
i1
i2 i4
i3
D
G
S
B
S
G
D
B
(a) (b) (c)
Figure 5.3: Current measurement for MOS transistors used in Hspice simulator
For MOS transistors, Hspice provides the ability to capture the Drain (D)
current, the Gate (G) current, the Source (S) current, and the Bulk (B) current
using current probes i1, i2, i3, and i4, respectively. Fig. 5.3(a) shows these cur-
rents and their Hspice-defined conventional directions.
90 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
The direction of gate, drain and source currents for MOS transistors is de-
fined by the value of VGS , VGD , and VDS . Fig. 5.3(b) shows the gate and
subthreshold leakage currents (broken lines) for an NMOS transistor in off-
state (i.e. VG = 0), and the gate leakage and Drain-Source currents (solid lines)
for an NMOS transistor in on-state (i.e. VG = Vdd). For a PMOS transistor,
Fig. 5.3(c) shows the gate and subthreshold leakage currents when VG = Vdd
(broken lines), and the gate leakage and Source-Drain currents when VG = 0,
(solid lines).
A number of observations can be made from these figures:
• Gate leakage currents exist in all transistors no matter if these are in
on- or off-state, as long as |VGS |> 0 and |VGD|> 0. If VG = VD = VS ,
there is however no gate leakage.
• When VG = Vdd, the gate leakage current is going into the transistor
through the Gate to either Drain or Source (or both) that have a
voltage potential less than Vdd.
• When VG = 0, gate leakage current is going out from the transistor
through the Gate from either Drain or Source (or both) that have a
voltage potential greater than VG = 0.
• A subthreshold leakage current exists only in those transistors that
are in off-state, and it goes from Drain to Source (NMOS) and from
Source to Drain (PMOS) for |VDS | > 0. If VD = VS , there is no
subthreshold leakage.
• A substrate leakage current exists in all transistors no matter if
these are in the on-state or in the off-state.
• The gate and substrate leakage currents are captured directly by
using probes i2 and i4, respectively, while the subthreshold leakage
current is captured by using either i1 or i3 probes depending on the
direction of the resulting gate leakage current. For example, in the
NMOS transistor showed in Fig. 5.3(b), the subthreshold leakage
current is captured by i1 for VG = Vdd, and by i3 for VG = 0.
5.3. PROBING METHODOLOGY FOR LEAKAGE 91
The observations above have lead to the following methodology to capture
leakage mechanisms using Hspice current probes for static CMOS circuits (re-
ferred to as the circuit in this section):
Capturing Total Leakage:
1. Following the Kirchhoff’s current law for the circuit, the summation of all
in-going currents to the circuit must be equal to the summation of all out-
going currents from the circuit. The total leakage current in the circuit
is equal either to the summation of all in-going currents to the circuit or
to the summation of all out-going currents from the circuit. If several
interconnected circuits are analyzed separately and if their total leakage
power is summed up (e.g. to obtain the total leakage power of a system),
then total leakage power for all separately analyzed circuits should be
obtained in the same manner, either by adding all in-going currents or by
adding all out-going currents.
2. The in-going currents to the circuit refer to those currents that go from
the supply voltage source (Vdd) through PMOS transistors that have their
Sources directly connected to Vdd (denoted as MpmosVdd
in Eq. 5.1); and
those gate leakage currents that go into the circuit through the Gate of the
transistors having VG = Vdd (denoted as MVG = Vdd in Eq. 5.1).
3. The out-going currents from the circuit refer to those currents that go
to the ground (gnd) through NMOS transistors that have their Sources
directly connected to the ground (denoted as Mnmosgnd in Eq. 5.2); and
those gate leakage currents that go out from the circuit through the Gate
of the transistors having VG = 0 (denoted as MVG=0 in Eq. 5.2).
4. By using Hspice current probes, equations of the total in-going and out-
going currents for the circuit are created:
Iin_goingleak =
∑
i
[i3(MpmosVdd
)]i +∑
j
[i2(MVG=Vdd]j
+∑
mp
[i4(Mpmos)]mp (5.1)
92 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
Iout_goingleak =
∑
k
[i3(Mnmosgnd )]k +
∑
t
[i2(MVG=0)]t
+∑
mn
[i4(Mnmos)]mn (5.2)
Here, i, j, k and t are the number of PMOS transistors that have their
Sources directly connected to Vdd, the number of the transistors that have
VG = Vdd, the number of NMOS transistors that have their Sources di-
rectly connected to the ground and the number of the transistors that have
VG = 0 inside the circuit, respectively. mn is the number of NMOS tran-
sistors, whereas mp is the number of PMOS transistors inside the circuit.
5. Eqs 5.1 and 5.2 are simplified by removing the current probes for those
transistors that have no gate and subthreshold leakage. Either Eq. 5.1 or
Eq. 5.2 represents the total leakage current of the circuit.
To Separate Leakage Mechanisms:
1. From Eq. 5.1 (or Eq. 5.2) the total substrate leakage is obtained by sum-
ming up all i4 probes for the PMOS (or NMOS) transistors of the circuit,
i.e. Eq. 5.3:
Isubstrateleak =
∑
mp
[i4(Mpmos)]mp =∑
mn
[i4(Mnmos)]mn (5.3)
2. To capture the total subthreshold leakage current of the circuit, three steps
need to be carried out: (i) In the circuit, for all conduction paths of sub-
threshold leakage currents connecting Vdd to gnd nodes, find the bound-
ary nodes1; (ii) For each conduction path, if the transistor located below
the boundary node is a PMOS, use current probe i3, otherwise use i1
to obtain the subthreshold leakage current for that path; (iii) The total
subthreshold leakage current of the circuit is the summation of currents
obtained in all conduction paths.
1The intermediate connection points between those transistors that have VG = Vdd and those
that have VG = 0.
5.4. POWER MODELS FOR ON-CHIP CACHES 93
3. The total gate leakage current of the circuit is obtained by subtracting the
total subthreshold leakage current from the total in-going (or out-going)
leakage current of the circuit.
For each cache component, the probing methodology is applied to capture
not only the total leakage power, but also other leakage components, i.e. the
gate, subthreshold and substrate leakage. In the next section, the detailed prob-
ing strategy for memory cells is shown. For other cache components, probing
schemes are obtained in a similar manner.
5.4 Power Models for On-Chip Caches
In this section, the characterization phase for each cache component is shown
and their obtained power models are described in detail.
5.4.1 Power Models for Partitioned Data SRAM Arrays
Organization Parameters
The assumed organization parameters for partitioned SRAM arrays are defined
in Table 5.1. As mentioned in Section 2.4.2, Wada et al. [9] showed how the
array can be split horizontally and vertically using Ndwl and Ndbl. Increasing
Ndwl and Ndbl, thus, yields shorter wordlines and bitlines, respectively, which
decreases the array access time, but increases the memory footprint. Increasing
Ndbl also increases the number of precharge circuits required, while increasing
Ndwl introduces a need for more wordline drivers. The parameter Nout, to-
gether with the number of available sense amplifiers (NSA) and write circuits
(NWRC ), defines the multiplexing ratio and the size of multiplexors required.
Increasing Nout would result in an increase of the required NSA and NWRC ,
or in an increase of the multiplexor size.
In general, partitioning with a large number of sub-arrays incurs a signif-
icant area overhead due to extra internal control logic. Clearly, determining
94 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
G l o
b a
l w
o r d
l i n
e d
r i v
e r s
SRAM Memory Array
RowDecoding
Circuit
. . .
Am+1
Am+2
An
Decoder_en
A-Byte Write Logic(8 x Write Drivers, 8 x Write_Control Circuits)
A-Byte Column Isolation Logic
Read_out Data(Byte)
Write_in Data(Byte)
A-Byte Read Logic(8 x Sense Amplifiers, 8 x Read_Control
Circuits)
Column Multiplexer (1:2m+1). . .
ColumnDecoding
Circuit
. . .
A0
A1
Am
Decoder_en
Read_en
Column_isolation
Write_en
LBL_PRE
GWLSub-
array...
...
...
...
...
...
...
...
...
...
...
...
Sub-
array
Sub-
array
Sub-
array
Sub-
array
Sub-
array
Sub-
array
Sub-
array
Sub-
array
Internalcontrollogic
GBL
... ... ...
LWL_SELLBL_SEL
Nsubarrays
Nsubarrays
An
An-j
AmA
m-i
Nrows
Ndwl
x
GBL
Figure 5.4: Block diagram of a partitioned SRAM array using DWL and DBL techniques
5.4. POWER MODELS FOR ON-CHIP CACHES 95
Table 5.1: Organization parameters for partitioned SRAM arrays
Symbols Meanings Parameters
Naddr Address width in bits Naddr = Nrowdecaddr
+ Ncoldecaddr
Nrowdecaddr
Number of addresses to row decoder integer (i.e. 1, 2, 3, 4, ...)
Ncoldecaddr
Number of addresses to column decoder integer (i.e. 1, 2, 3, 4, ...)
Nout Output width in bits integer, multiple of 8
(i.e. 8, 16, 32 ...)
Ndwl Number of segments per wordline 1, 2, 4, 8, ...
Ndbl Number of segments per bitline 1, 2, 4, 8, ...
Nsub−arrays Total number of sub-arrays Nsub−arrays = Ndwl ×Ndbl
Nrows Number of rows Nrows = 2Nrowdec
addr
Nwords Number of addressable words Nwords = 2Ncoldec
addr
Nwlength Word length integer, multiple of 8
(= 8 in this thesis)
Ncolumns Number of columns Ncolumns = Nwords ×Nwlength
sub-array organization is about striking a good balance between energy savings
and access time reduction, and the overhead for supporting them.
Fig. 5.4 shows the block diagram of a partitioned SRAM array using DWL
and DBL techniques; the original array is divided into Ndwl ×Ndbl sub-arrays.
Each sub-array takes as inputs global wordlines (WL) from global WL drivers,
global bitlines (BL), and several control signals from the internal control logic,
e.g. local BL precharge (LBL_PRE), local WL selection (LWL_SEL) and
local BL selection (LBL_SEL) signals. Inside each sub-array (Fig. 5.5), global
WL is AND-ed with LWL_SEL to create local WLs, and local BLs are con-
nected to global BLs through pass transistors controlled by LBL_SEL sig-
96 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
nals. For each local WL there is a local WL driver used to drive the WL se-
lection signal to the memory cells. Each local BL has a local precharge circuit
controlled by LBL_PRE. As it is straightforward to implement, the static
pull-up BL precharging scheme is widely used in partitioned SRAM arrays and
caches [10]—this is assumed for the partitioned array configuration used in this
section.
Cell
Cell
Cell
Cell
GWL
LWLLWL_SEL
G B
L
L B
L
LBL_SEL
Prechargecircuit
LBL_PRE
Prechargecircuit
LWL drivers
L B
L
G B
L
Figure 5.5: Organization of a sub-array
Power Models for Partitioned Data SRAM Arrays
The power models for SRAM memory components are summarized in the fol-
lowing equations:
Total Powerarray =∑
i
(Pdyn + Pleak)i (5.4)
where, i is the index for components of a SRAM array including memory cells,
SA, WRC, wordline drivers, decoders, multiplexers and column isolation logic.
With reference to Fig. 5.5, a read or a write is preceded by a precharge
of the selected LBL/LBL to Vdd, and the selection of local row/column by
row/column decoders based on a given address. A local wordline and a local
5.4. POWER MODELS FOR ON-CHIP CACHES 97
bitline (or a set of them) are selected by using LWL_SEL and LBL_SEL
signals to read or write memory cell(s). During read, column isolation PMOS
transistors are turned ON to allow the voltage difference between the selected
LBL/LBL (connected to sense amplifiers through GBL/GBL) to develop
to the sensing voltage (Vsense), after which they are turned OFF to isolate
LBL/LBL from sense amplifiers, helping the amplifiers to quickly sense the
data stored in the cells that are accessed. Multiplexing NMOS transistors
(MUXes) are used to connect write circuits to the selected pairs of LBL/LBL
during write cycles only; the write circuits are idle (leaking) during read cycles.
Local bitline precharging uses a static pull-up scheme that statically leaves
them on all the time [10]; precharging turns OFF only in the evaluation phase
of read/write cycles. Bitline precharge time is designed to be partially hid-
den under the address decoding time, to reduce the power dissipated by the
precharge buffers, while still achieving a short read/write time. Drivers of the
write circuits are designed to be powerful enough, so that they can pull down
precharged LBL/LBL (connected to the write circuits through GBL/GBL) to
zero fast. Sense amplifiers (SA) are designed to have Vsense= 200 mV and the
bitlines surrounding the SA are always precharged to Vdd before turning ON
isolation transistors and the wordline for a read. The architectural selection of
SA and the precharging scheme was motivated by the fact that this type of SA
dissipates less short-circuit power than one that requires precharging to V dd2 .
Memory cells:
In partitioned arrays, the dynamic power of a read operation is due to LBL/
LBL and GBL/GBL discharging currents through the accessed cell, while
write dynamic power is due to discharging currents through the write circuits.
The “passive read” dynamic power is due to LBL/LBL and GBL/GBL dis-
charging currents through the opened pass transistors into cells, which share the
same local wordline with the selected cell, while GBL/GBL are disconnected
from all SAs and write circuits (WRC). The number of “passive read” cells can
be defined as N pass.readmcells = Ncolumns
Ndwl− Nwlength, and it is decreasing with an
98 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
a) b)
CLBL
LBL
VLBL = 1.2 V
VLWL= 0 -> 1.2V
Vdd
Memcell
LBL
”0” ”1”
Vdd
VLBL = 1.2 V
VLWL
= 0 V
gnd CLBLCLBL
LBL
VLBL
= 1.2 V
Figure 5.6: (a) Characterization of a 6T-SRAM cell, (b) Hspice configuration for VLBL
estimation
increasing Ndwl. Hence, the “passive read” dynamic power is lower in parti-
tioned arrays than in the unpartitioned array [11].
Characterization of a memory cell is done by performing a circuit-level DC
simulation for a single cell connected to a pair of LBL and LBL to quantify
all leakage components (Fig. 5.6a). The dynamic power dissipation can be
accurately estimated using Eq. 5.6, given global and local bitline capacitance
CGBL, CLBL, global and local bitline voltage swing ∆VGBL, ∆VLBL. The
“passive read” power is estimated by using Eq. 5.7, where “passive read” global
and local bitline voltage swing ∆V pass.readGBL , ∆V pass.read
LBL are obtained using
Eqs 5.10 and 5.11.
Pmcellsdyn = Nwlength Pmcell
active + Npass.readmcells Pmcell
pass.read (5.5)
Pmcellactive = Vdd fclk (CGBL∆VGBL + CLBL∆VLBL) (5.6)
P mcellpass.read = Vdd fclk CGBL ∆V pass.read
GBL
5.4. POWER MODELS FOR ON-CHIP CACHES 99
+ Vdd fclk CLBL ∆V pass.readLBL (5.7)
CGBL = Ndbl Cnmos_passdrain + Cmux
source + Cisodrain + CGBL
wire (5.8)
CLBL =Nrows
NdblCmcell + Cnmos_pass
source
+ 2 Cpmos_prechdrain + CLBL
wire (5.9)
∆V pass.readLBL = ∆Twordline
ILBL_discharge
C LBL(5.10)
∆V pass.readGBL = ‖∆V pass.read
LBL − (Vdd − V initialGBL )‖ (5.11)
P mcellsleak = Nmcells I mcell
leak Vdd (5.12)
Here, Cmcell, Cnmos_passdrain , Cnmos_pass
source , Cmuxsource, Ciso
drain, Cpmos_prechdrain , CGBL
wire ,
and CLBLwire are a cell’s load capacitance onto the bitline, drain and source ca-
pacitances of an NMOS transistor connecting local to global bitlines, source
capacitance of a MUX NMOS transistor, drain capacitance of an ISO PMOS
transistor, drain capacitance of a precharge PMOS transistor, global bitline wire
and local bitline wire capacitances, respectively. ∆Twordline is the time during
which a wordline is on, and ILBL_discharge is the local bitline discharging cur-
rent that can be obtained by running a circuit-level DC simulation for a stack
of the two NMOS transistors (from the SRAM cell) connected between a local
bitline (precharged to Vdd) and ground (see Fig. 5.6b). V initialGBL is the initial
voltage level of the non-selected global bitline when the evaluation cycle starts.
Fig. 5.7 shows the subthreshold and gate leakage currents for a partitioned
6T-SRAM cell. The total leakage power of memory cells is estimated using
Eq. 5.12, where Nmcells = 2Naddr is the number of memory cells and I mcellleak
is the total leakage current for a single memory cell, which is obtained by using
the methodology given in the Section 5.3, and defined either by Eq. 5.13 or
Eq. 5.14:
I mcellleak = i1(PT 1) + i1(PT 2) + i3(P1) + i3(P2)
+ i4(P1) + i4(P2) (5.13)
100 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
VLWL
= ”0”
Vdd
gnd
P1 P2
N1 N2
PT1
PT2”0” ”1”
LBL = ”1” LBL = ”1”
VLBL_SEL
= ”0”
GBL GBL
PT3 PT4
Figure 5.7: Subthreshold (green, solid) and gate leakage (red, dotted) currents in a
partitioned 6T-SRAM cell
I mcellleak = i2(PT 1) + i2(PT 2) + i2(PT 3) + i2(PT 4)
+ i3(N1) + i3(N2) + i4(N1) + i4(N2)
+ i4(PT 1) + i4(PT 2) + i4(PT 3) + i4(PT 4) (5.14)
Sense Amplifier:
Power dissipation of a sense amplifier (SA) consists of leakage and dynamic
components. Dynamic power is due to the current that discharges bitlines of the
SA (referred to as BLSA/BLSA) from (Vdd − Vsense) to zero, which can be
estimated using Eq. 5.15, given bitline SA capacitance CBLSA, fclk, bitline SA
voltage swing ∆VBLSA = Vdd−Vsense and Vdd. Leakage power is obtained by
running a circuit-level DC simulation with the appropriate probes for a single
SA with the configuration for characterization shown in Fig. 5.8a. Here, NSA is
the number of sense amplifiers used in the array, CSA, Cisosource and CGBL
wire are
the capacitance of a SA, the source capacitance of a PMOS isolation transistor
and the GBL wire capacitance, respectively.
P SAsdyn = Vdd fclk CBLSA (Vdd − Vsense) (5.15)
5.4. POWER MODELS FOR ON-CHIP CACHES 101
b)
BLWRC
CBLWRC
Write Circuit
Vdd
VWRDATA
= 0 V
VWR_ENABLE
= 0 V
VMUX= 0 V
VBLWRC
= 1.2 V
BLWRC
C BLWRC
gnd
VBLWRC
= 1.2 V
V WRDATA = 1.2 V
VWR_ENABLE = 1.2 V
a)
CBLSA
CBLSA
BLSA
VBLSA = 1.2 V
SenseAmpV
ddV
dd
Vdd
VSAENABLE = 0 V
VSAPRE
= 1.2 V
BLSA
V BLSA = 1.2 V
gnd
Figure 5.8: Characterization of (a) a sense amplifier, (b) a write circuit
P SAsleak = NSA I SA
leak Vdd (5.16)
CBLSA = CSA + Nwords Cisosource + CGBL
wire (5.17)
Writing logic:
The write circuit (WRC) dissipates power dynamically through the cur-
rent that discharges the selected, precharged LBL/LBL (connected to WRC
through a pair of selected GBL/GBL) from Vdd to zero while driving zero or
one to the selected cell (Fig. 5.8b). This power dissipation is estimated using
Eqs 5.18 – 5.21 for given ∆V writeGBL , ∆V write
LBL , fclk, Vdd, CGBL, CLBL, and
CBLWRC that is calculated using Eq. 5.22. The leakage power is obtained by
using Eq. 5.23, where I WRCleak is the leakage current obtained by characteriza-
tion for a single write circuit applying the probing methodology described in
the Section 5.3.
P WRCsdyn = NWRC (P BLWRC
dyn + P GBLdyn + P LBL
dyn ) (5.18)
P BLWRCdyn = V 2
dd fclk CBLWRC (5.19)
102 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
P GBLdyn = Vdd fclk ∆V write
GBL CGBL (5.20)
P LBLdyn = Vdd fclk ∆V write
LBL CLBL (5.21)
CBLWRC = CWRC + Nwords Cmuxsource + Cmux
gate + CGBLwire (5.22)
P WRCsleak = NWRC I WRC
leak Vdd (5.23)
Here, CBLWRC is the bitline capacitance of a write circuit, Cmuxgate is the gate
capacitance of a MUX NMOS transistor, and ∆V writeGBL , ∆V write
LBL are the voltage
swings of global and local bitlines in write cycles, respectively.
Global/Local wordline drivers:
There are 2Nrowdecaddr global wordline drivers and Ndwl×2Nrowdec
addr local word-
line drivers for a given row decoder with N rowdecaddr memory address bits, how-
ever only one global and one local driver are active each read/write cycle, while
the rest are idle and leaking. Although the total number of wordline drivers
for a partitioned array is increased (with respect to the unpartitioned array), the
size of each global wordline driver is smaller due to smaller driving capaci-
tances. The dynamic power of global and local wordline drivers is estimated
using Eqs 5.24 – 5.27 for a given input capacitance of an AND gate CANDgate ,
the gate capacitance of the cell’s NMOS pass transistor Cnmos_passgate , the output
capacitance of GWL/LBL drivers CGWLDrvout , CLWLDrv
out , and GWL/LBL wire
capacitances CGWLwire , CLWL
wire , respectively.
P GwlDrvdyn = V 2
dd fclk CGWL (5.24)
P LwlDrvdyn = V 2
dd fclk CLWL (5.25)
CGWL = Ndwl CANDgate + CGWLDrv
out + CGWLwire (5.26)
CLWL = 2Ncolumns
NdwlCnmos_pass
gate + CLWLDrvout + CLWL
wire (5.27)
P GwlDrvleak = Nrows I GwlDrv
leak Vdd (5.28)
P LwlDrvleak = Ndwl Nrows I LwlDrv
leak Vdd (5.29)
5.4. POWER MODELS FOR ON-CHIP CACHES 103
A circuit-level DC simulation for a single global/local wordline driver estab-
lishes the leakage power components for Eqs 5.28 and 5.29 using the probing
methodology described in the Section 5.3.
Address Decoders:
Fig. 5.9 shows the architecture of a row/column decoder used in this thesis.
This architecture is similar to the one used in CACTI [12] for cache delay esti-
mation. For a given number of address bits Naddr, the number of 3to8 (=N3to8)
and 2to4 decoders (=N2to4), the number of NOR gates (=Nnor), the number
of inverters (=Ninv) and the number of wordline drivers (=NwlDrv) required
for the implementation of address row decoder (rowdec) and column decoder
(coldec) are given in Eq. 5.30, and Eq. 5.32, Eq. 5.33, respectively.
Naddr = 3 N3to8 + 2 N2to4 (5.30)
Naddr = N rowdecaddr + N coldec
addr (5.31)
Nrows = 2Nrowdecaddr = N rowdec
nor = N rowdecinv = NwlDrv (5.32)
Nwords = 2Ncoldecaddr = N coldec
nor = 0.5N coldecinv (5.33)
Here, recalls from Table 5.1, N rowdecaddr and N coldec
addr is the number of addresses
used for row and column in bits, respectively, and Nwords is the number of ad-
dressable words used in this memory array.
Each 3to8 and 2to4 decoder is typically implemented using NAND gates
and inverters to complement the address inputs. During each read/write cy-
cle, the decoder enables signal DecSel triggering the decoder’s outputs. Each
NOR gate collects an output from every decoder and then, together with an
inverter and a wordline driver, forms a wordline activation signal. Since the
0→1 nodal transition is considered to be the power consuming one, all NAND,
NOR, inverter gates are active when making 0→1 transitions, and are inactive
and leaking otherwise. The leakage power is obtained by running a circuit-level
DC simulation with appropriate probes for each row or column decoder with
104 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
.
.
.
3-8
dec
..
.DecselClock Out13
Out20
A10A11A12
Out14
3-8
dec
..
.DecselClock
A7A8A9
Out5
Out12
Out6
2-4dec
A6A5
Clock Out1
Out2
Out3
Out4
.
.
.
NOR gates inverters
Row1
Row2
Row256Decsel
Figure 5.9: Architecture of a 8-256 row decoder
no 0→1 transitions in the addresses. A probabilistic method is used to esti-
mate dynamic power dissipation of row and column decoders. Based on the
method described in [13], a transition activity factor α0→1 can be calculated
for each node assuming that all addresses to decoders have equal probability,
and DecSel is turned ON in every read/write cycles. The dynamic and leak-
age power dissipation of a row/column decoder are calculated by Eq. 5.34 and
Eq. 5.35, respectively.
P decdyn = P dec3to8
dyn + P dec2to4dyn + P nor
dyn + P invdyn (5.34)
P decleak = P dec3to8
leak + P dec2to4leak + P nor
leak + P invleak (5.35)
where,
P dec3to8dyn = V 2
dd fclk(NinvαinvCinv + NoutαoutCout)dec3to8
P dec2to4dyn = V 2
dd fclk(NinvαinvCinv + NoutαoutCout)dec2to4
P nordyn + P inv
dyn = V 2dd fclk αnor(Cnor + Cinv)
5.4. POWER MODELS FOR ON-CHIP CACHES 105
P dec3to8leak = P dec3to8
leak_inv + P dec3to8leak_nand
P dec2to4leak = P dec2to4
leak_inv + P dec2to4leak_nand
P dec3to8leak_nand = 8 (1 − α dec3to8
out ) I dec3to8leak_nand Vdd
P dec2to4leak_nand = 4 (1 − α dec2to4
out ) I dec2to4leak_nand Vdd
P dec3to8leak_inv = 3 (1 − α dec3to8
inv ) I dec3to8leak_inv Vdd
P dec2to4leak_inv = 2 (1 − α dec2to4
inv ) I dec2to4leak_inv Vdd
P norleak + P inv
leak = (N − 1) (I norleak + I inv
leak ) Vdd
Here, αinv , αout, αnor are the ’0→1’ transition activity factors for address
inverters, NAND gates (inside 3to8 and 2to4 decoders), and for NOR gates,
respectively. For equally probable address inputs to decoders, αinv = 0.25,
αdec3to8out = 0.1094, αdec2to4
out = 0.1875, and αnor = 1N , where N = Nrows and
N = Nwords for row and column decoders, respectively.
MUX and Isolation logic:
There are 2×Ncolumns NMOS and PMOS transistors used for multiplex-
ing WRCs and SAs to GBL/GBL, respectively. During a read cycle, PMOS
isolation transistors are turned ON, while NMOS transistors are turned OFF,
and during a write cycle PMOS transistors are OFF, while NMOS are ON. The
number of idle NMOS (Nmux) and PMOS (Niso) transistors is inversely pro-
portional to Nwlength, i.e. a longer access word length leads to a fewer number
of idle MUX and isolation transistors. In the model verification part, since
Nwlength = 8 bits, there are 2×(Ncolumns − Nwlength) = 496 idle transistors
for the 8-KB and 240 for the 2-KB array, hence giving rise significantly to leak-
age power. The leakage power in MUXes is due to the leakage currents to the
substrate and through the gate. By running a circuit-level DC simulation for
an off-state NMOS and an off-state PMOS connected between Vdd and ground,
leakage currents for those off-state transistors are captured.
P muxdyn + P iso
dyn = Nmux Inmosdyn + Niso Ipmos
dyn (5.36)
P muxleak + P iso
leak = Nmux Inmosleak + Niso Ipmos
leak (5.37)
106 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
5.4.2 Power Models for Unpartitioned Data SRAM Arrays
As compared to a physically partitioned SRAM array of the same size, unpar-
titioned array has a simpler organization. There are no global bitlines, global
wordlines, neither global wordline drivers nor extra control circuits for sub-
array selection. Therefore, power modeling for unpartitioned SRAM arrays is
straight-forward and it is simpler in comparison to the one for partitioned ar-
rays. Component power models of a partitioned array can be reused directly for
some components of an unpartitioned array such as SA, row/column decoders,
and MUX/isolation logic. For the rest of components, some modifications to
their power models are required. Eqs 5.38 - 5.46 show the obtained power
models for memory cells, WRC and wordline drivers, respectively.
Memory cells:
P mcellsdyn = Vdd fclk CBL (∆VBL + ∆V pass.read
BL ) (5.38)
CBL = Nrows Cmcell + 2 Cpmos_prechdrain + Cmux
source
+ Cisodrain + CBL
wire (5.39)
P mcellsleak = Nmcells I mcell
leak Vdd (5.40)
Writing logic:
P WRCsdyn = V 2
dd fclk CBLWRC (5.41)
CBLWRC = CWRC + Nwords Cmuxsource + Cmux
gate + CBLwire (5.42)
P WRCsleak = NWRC I WRC
leak Vdd (5.43)
Wordline drivers:
P wlDrvdyn = V 2
dd fclk CWL (5.44)
CWL = 2 Ncolumns Cnmos_passgate + CWLDrv
out + CWLwire (5.45)
P wlDrvleak = Nrows I wlDrv
leak Vdd (5.46)
5.4. POWER MODELS FOR ON-CHIP CACHES 107
Here, ∆VBL is the bitline voltage swing and ∆V pass.readBL is the “passive read”
bitline voltage swing. CBLwire, CWL
wire, CWLDrvout and Ciso
drain are the bitline wire
capacitance, the wordline wire capacitance, the output capacitance of a word-
line driver, and the drain capacitance an ISO PMOS transistor, respectively.
5.4.3 Power Models for SRAM-based Tag Arrays
Tag Array Organization Parameters
Table 5.2 shows the assumed organization parameters for the partitioned SRAM-
based tag arrays. The size of a tag field Ntag is calculated using the following
equation:
Ntag = Nmem_addr − Nindex
+ log2A − NByte_offset − NWord_block (5.47)
In order to reduce the total power in a tag array, physical partitioning tech-
niques also have been used. Physically partitioned tag arrays usually are parti-
tioned vertically using the DBL technique [14] but not horizontally since they
are often designed to read a complete tag-line at a time, as fast as possible.
A horizontally partitioned tag array may require several clock cycles to read
a complete tag-line thus it slows down the cache performance. Therefore, in
this thesis only vertically partitioned tag arrays are considered, i.e. Ntwl = 1 is
assumed to be constant.
Power Models for Partitioned SRAM-based Tag Arrays
Fig. 2.11 in Section 2.4.1 shows the organization of a typical SRAM-based
cache used as the basic organization assumed for power modeling throughout
this work. It is clear from this figure that the tag array has two additional com-
ponents comparing to the data array: the comparators and the MUX drivers.
However, the MUX drivers dissipate power insignificantly compared to the
comparator [12], therefore it will not be considered in this work.
108 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
Table 5.2: Organization parameters for partitioned SRAM-based tag arrays
Symbols Meanings Parameters
C Cache size in Bytes
B Block size in Bytes
A Associativity integer (i.e. 1, 2, 3, 4, ...)
Ntwl Number of segments per tag wordline 1, 2, 4, 8, ...
Ntbl Number of segments per tag bitline 1, 2, 4, 8, ...
Ntag
sub−arraysTotal number of tag sub-arrays N
tag
sub−arrays= Ntdwl ×Ntdbl
Ntag Size of tag field integer (i.e. 1, 2, 3, 4, ...)
Nmem_addr Memory address width in bits integer (i.e. 1, 2, 3, 4, ...)
Nindex Index in bits integer (i.e. 1, 2, 3, 4, ...)
NByte_offset Byte offset in bits integer (i.e. 1, 2, 3, 4, ...)
NWord_block Word offset in bits integer (i.e. 1, 2, 3, 4, ...)
Comparator:
Fig. 5.10 shows the structure of a typical NOR-based comparator assumed
for power modeling in this section. This architecture is similar to the one used
in CACTI [12]. The outputs from the tag SAs of the tag array are connected to
the inputs labeled an and an, while the bn and bn inputs are driven by tag bits in
the address (it is also referred to as Search Lines – SLs in Fig. 2.11). Here, the
index n = 0, 1, 2, ... Ntag. The node OUTcmp is the output of the comparator
which is connected to the input of a match-line sense amplifier (MLSA). The
output of a MLSA is the match result denoted as ML. The node EVAL is used
as a “virtual ground” for the pull-down paths of the comparator. The working
principle of a NOR-based comparator consists of three phases [15]:
1. SL precharge: precharge the search lines (bn/bn) to low
5.4. POWER MODELS FOR ON-CHIP CACHES 109
M1M2
M3
M4
a0 a0
b0
b0
. . .
aN
tag
aN
tag
bN
tagb
Ntag
a1
b1
b1
a1
Vdd
MLSA
gnd
EVAL
From a
dummy
tag SA
MLOUTcmp
PRECHcmpM
pre
Figure 5.10: The structure of a typical Ntag-bits NOR-based comparator
2. Match-line precharge: precharge the OUTcmp to high by turning ON the
precharge PMOS transistor
3. Match-line evaluation: (i) turn OFF the precharge PMOS transistor;
(ii) drive SLs (bn/bn) to the tag bits in the address; (iii) drive an/an to the
outputs from the tag SAs; (iv) perform comparison and drive OUTcmp to
the MLSA that in turn generates a match result based on the voltage level
it senses.
In the match-line evaluation phase, to ensure that the output OUTcmp is
not discharged before the an bits become stable, node EVAL is held high un-
til roughly three inverter delays after the generation of the an signals. This is
accomplished by using a timing chain driven by a tag SA in the tag array. The
output of this timing chain is connected to EVAL [12]. For simplicity of power
110 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
modeling, the SL precharge and match-line precharge phases are combined to
be one denoted as the Comparator precharge phase.
Applying the methodology given in the Section 5.3 to the comparator cir-
cuit, some observations can be made:
• For each pair of the comparing bits an and bn there are two pull-down
paths: the first consists of two NMOS transistors M1 and M4, and the
second – M2 and M3 (see Fig. 5.10).
• During the comparator precharge phase, the node EVAL and OUTcmp
are high, thus there is no subthreshold leakage in the comparator circuits.
However, there are gate leakage currents out-going from those MOS tran-
sistors that have VG = 0. For example the precharge PMOS transistor
Mpre has a gate leakage current running from its source with VS = Vdd
to its gate which is captured by using the probe i2(Mpre). However, this
leakage current turns out to be negligible.
• In the match-line evaluation phase, i.e. Mpre is OFF and VEVAL = 0, there
are two possible cases: match or mismatch. A match-case occurs when
either an = bn = Vdd or an = bn = 0, while a mismatch-case occurs when
an 6= bn. In the match-cases, there are no paths connecting OUTcmp to
ground, so there is no dynamic power, but only leakage. In the mismatch-
cases, the dynamic power of the comparator is due to the OUTcmp dis-
charging current running through a number of pull-down paths and an
on-state NMOS transistor of the last-stage inverter of the timing chain
to the ground. The value of this current depends on the number of mis-
matched bits. The leakage power in this case is also negligible.
Based on these observations, the power models for a comparator of Ntag
bits are described by Eqs 5.48 - 5.55:
P cmptotal = Hcache P cmp
match + (1 − Hcache) P cmpmismatch
= Hcache Vdd I cmpleak + (1 − Hcache) P cmp
dyn (5.48)
5.5. VALIDATION 111
P cmpdyn = ∆V
OUTcmp
swing C OUTcmp Vdd fclk (5.49)
C OUTcmp = Ntag C nmosdrain + C prech_pmos
source + CMLSA (5.50)
I cmpleak = N an=1
match I an=1a_bit_leak + N an=0
match I an=0a_bit_leak + I inv
leak (5.51)
Here, Hcache is the cache hit ratio2, P cmpmatch and P cmp
mismatch are the comparator
power values in the match-case and mismatch-case, respectively. I cmpleak is the
total leakage current of the comparator for the match-case, and P cmpdyn is the
total dynamic power of the comparator for the mismatch-case. ∆VOUTcmp
swing is
the voltage swing of the node OUTcmp when a mismatch occurs, and C OUTcmp
is the output capacitance of the comparator. C nmosdrain , C prech_pmos
source and CMLSA
are the drain capacitance of a NMOS transistor, the source capacitance of the
PMOS precharge transistor and the input capacitance of a MLSA, respectively.
Using the methodology given in the Section 5.3, the total leakage current of
a pair of pull-down paths for a match-case with an = 1 and with an = 0 is
obtained by Eq. 5.53, and Eq. 5.55, respectively.
I an=1a_bit_leak = i2(M3) + i1(M1) + i1(M2) (5.52)
= i2(M2) + i2(M4) + i3(M3) + i3(M4) (5.53)
I an=0a_bit_leak = i2(M4) + i1(M2) + i1(M1) (5.54)
= i2(M1) + i2(M3) + i3(M3) + i3(M4) (5.55)
5.5 Validation
In this section, the validation results of power models for partitioned and un-
partitioned data SRAM arrays, and for a partitioned SRAM-based tag array are
given. Based on the obtained validation results, some analyses and discussions
are presented and conclusions are drawn.
2For L1 and L2 on-chip caches, the hit ratio is intentionally maintained to be high by employing
many architecture-level cache management policies and techniques. The average hit ratio is about
99.8% and 98.5% for the I-cache and the D-cache, respectively [16].
112 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
5.5.1 Validation Methodology
Below, a brief summary of the validation methodology will be presented. The
methodology has been applied to prove the validity of the obtained power mod-
els against circuit-level simulations for several complete physically partitioned
and unpartitioned data SRAM arrays with different configurations, and also for
a partitioned SRAM-based tag array.
1. Select initial cache/memory-organization parameters, e.g. the size, the
output width, the access word length, the associativity, etc. Then, use
CACTI 3.2 tool to generate the configuration parameters for all unparti-
tioned and partitioned data arrays, and for the partitioned tag array.
2. Create netlists of these arrays based on the obtained configuration param-
eters. Select a typical structure for each array component which is widely
used in the literature and the research community.
3. Size properly netlists of these arrays in some available commercial and
predictive CMOS processes. Perform simulations and static time anal-
yses to provide a proper functionality for each array. In this validation
work, a commercial 0.13-µm process (Vdd = 1.2 V; normal VT = VTH0
≈ 0.25 V) and a Berkeley Predictive Technology Model (BPTM) 65-nm
process (Vdd = 1.1 V; V nmosT ≈ 0.42 V and |V pmos
T | ≈ 0.36 V) have been
used.
4. Select the process-dependent parameters (i.e. threshold voltage VT , sup-
ply voltage VDD and process corners PV ), and other parameters (e.g.
temperature T , frequency F ) for setting up the simulation environment
for a circuit-level simulator (in this case, Hspice) to perform simulations
and analyses.
5. For each array, perform Component Characterization for each array com-
ponent by running few, simple Hspice DC simulations using the appro-
priate probes to obtain both the total leakage power value and the power
value of each leakage components, i.e. gate, subthreshold and substrate
5.5. VALIDATION 113
leakage. In the same time, extract the nodal capacitances for each array
component. Then, tabulate those leakage power values, and the nodal
capacitances into the pre-characterized leakage tables for use in the next
step.
6. For each component, using the proposed power models with the obtained
nodal capacitances to calculate the dynamic power dissipation value. To-
tal power dissipation of the component is the sum of dynamic and leakage
power dissipation values.
7. For each array, calculate the total power dissipation value by summing up
all the component’s power dissipation values for each reading and writing
state.
8. For each array, perform several Hspice transient analyses to obtain the av-
erage per-cycle total power dissipation values for each reading and writ-
ing state.
9. For each array, compare the power value obtained in step 7 with the power
value obtained in step 8 for each reading and writing state to draw con-
clusions.
10. Repeat from step 4 for any changes in the value of PV , VT , VDD and T .
Repeat from step 1 for any changes in the cache/memory-organization
and configuration parameters.
The random nature of the input addresses to row/column decoders calls for
substantial modifications to the above-mentioned steps 6 and 8.
• Step 6a: For each address decoder, using the proposed power models
with the obtained nodal transition activity factors α0→1 to calculate the
total dynamic and leakage power dissipation values. Total power dissi-
pation of the decoder is the sum of the total dynamic and leakage power
dissipation values.
• Step 8a: For each address decoder, the total per-cycle power value is esti-
mated by running an Hspice transient analysis for a long trace consisting
114 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
of several hundreds of random read/write accesses. In this work, a trace
consisting of one thousand of random read/write accesses has been used.
In this validation work, five data SRAM arrays with different configurations
and one SRAM-based tag array have been used. For simplicity in referring to
them, each array is assigned with a configuration letter as follows:
• Three 8-KB data SRAM arrays: an unpartitioned data array (referred to
as 8A), a partitioned data array with Ndwl= 4, Ndbl= 16 (referred to as
8B), and a partitioned data array with Ndwl = Ndbl= 16 (referred to as
8C).
• Two 2-KB data SRAM arrays: an unpartitioned data array (referred to as
2A) and a partitioned data array with Ndwl= 4, Ndbl= 8 (referred to as
2B).
• A 2-KB partitioned SRAM-based tag array with Ntwl= 1, Ntbl= 8.
All three 8-KB data SRAM arrays have been implemented in a commercial
0.13-µm CMOS process, while one 2-KB data SRAM array (i.e. 2A) has been
implemented in both a commercial 0.13-µm and a 65-nm BPTM bulk CMOS
process. Both the 2-KB partitioned data SRAM array (i.e. 2B) and the 2-KB
partitioned SRAM-based tag array have been implemented in a 65-nm BPTM
bulk CMOS process.
For conventional memory arrays in typical applications, the operational
temperatures T are ranging from 40◦C to 110◦C, thus a nominal middle tem-
perature point T = 70◦C has been selected. The typical process corner value
(PV = typical), the normal supply voltage and the normal threshold voltage for
both CMOS processes have been also used in most Hspice simulations required
for this validation work. The frequency of accessing to any array is defined by
the static time analysis of each read/write cycle of that array. For example, array
2A has its fclk = 400 MHz and 512 MHz when it is implemented in a 0.13-µm
and a BPTM 65-nm process, respectively.
5.5. VALIDATION 115
5.5.2 Validation of Power Models for Data SRAM Arrays
To prove the validity of the proposed power models for data SRAM array the
above-presented validation methodology has been used for several partitioned
and unpartitioned arrays of 8 KBytes and 2 KBytes in size. The selection of
the array sizes is motivated by the fact that 8 KBytes is the practical size limit
of a memory bank, or in other words, it is the largest allowable size of a single
SRAM memory in order to maintain an acceptable access time without imple-
menting a memory-banking technique. To further reduce the total power dis-
sipation of an SRAM array, physical partitioning techniques are used for each
separate SRAM bank.
0
1
2
3
4
5
6
7
Mem
cells
Sense
Am
p
Writ
e Circ
uit
Word
Line
Drv
s
Row
/Col D
ecoder
s
Inte
rnal
CTR
Total A
rray
Pow
er
Po
we
r (m
W)
Figure 5.11: Total power dissipation of 8-KB data arrays [blue/grey — 8A, brown/black
— 8B, yellow/white — 8C]
Fig. 5.11 shows the total power dissipation of the unpartitioned array 8A
(in blue/grey), the partitioned array 8B (in brown/black), the partitioned array
8C (in yellow/white) implemented in a commercial 0.13-µm process, and its
component’s power values. The most basic observation here is that partition-
ing reduces total power dissipation of arrays, obviously by reducing the power
dissipation in memory cells. Although partitioning requires some extra power
116 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
dissipation of internal control circuits and introduces some overheads in term
of delay due to wakeup time, the total power dissipation of a partitioned array
is reduced significantly in comparison to the unpartitioned case. For example,
partitioning an 8-kB array with Ndbl = 16, Ndwl = 4 (i.e. the array 8B) reduces
active power dissipations by 65%, and leakage power by 21% which resulted
in 60% total power reduction in comparison to the unpartitioned array (8A) of
the same size. In addition, Fig. 5.11 also shows that the array 8B, whose con-
figuration was optimized for speed and power by CACTI 3.2, has a total power
dissipation higher than the total power of the array 8C with less-optimized con-
figuration for speed.
0
500
1000
1500
2000
2500
3000
3500
Mce
lls
Sense
Amp
Writ
e Circ
uit
Wor
dLine
Drv
s
Row
/Col D
ecod
ers
Inte
rnal C
TR
Total A
rray Pow
er
Po
we
r (u
W)
Figure 5.12: Total power dissipation of 2-KB data arrays [blue/grey — 2A, brown/black
— 2B]
Fig. 5.12 shows the total power dissipation of the 2-KB unpartitioned and
partitioned arrays implemented in a 65-nm BPTM process as well as a power
breakdown into individual components. In this case, although partitioning still
reduces the total power dissipation (only by 8.5%) of the partitioned array 2B in
comparison to the unpartitioned array 2A, it is obvious that the partitioning con-
figuration suggested by CACTI 3.2 for the array 2B (i.e. Ndbl = 8, Ndwl = 4)
5.5. VALIDATION 117
is a non-optimal one in terms of power reduction. Figs 5.11 and 5.12 also clearly
pointed out the main contributors to total array power dissipation: the memory
cells, the write circuits and the row/column decoders.
Fig. 5.13 show accuracy values in estimating dynamic, leakage, and total
power dissipation for unpartitioned and partitioned data arrays, respectively.
For the main component contributor to total power dissipation of the array, the
memory cells, the proposed models achieve very high accuracy in estimating
dynamic power (96%), leakage power (94% for unpartitioned and 98% for par-
titoned arrays), and the total power (97%). Although the models get low accu-
racy in estimating dynamic power for wordline drivers (85%), and in estimating
dynamic and leakage power for write circuits (82%), it still offers very high
accuracy in estimating total power dissipation for all the data arrays (97%).
Fig. 5.14 shows similar accuracy figures in estimating dynamic, leakage,
and total power dissipation for 2-KB unpartitioned and partitioned arrays, re-
spectively. In this case, the accuracy achieved by using the proposed models is
high for the memory cells, the SAs, and the wordline drivers. The worst case in
terms of accuracy is WRC (as low as 84%). A reason for this inaccuracy may
be explained by short-circuit power of WRC, which has not been captured in
the proposed models, yet.
The proportion of dynamic and leakage power in each array component is
interesting information. Figs 5.15 and 5.16 show the proportion of dynamic and
leakage power (the sum of dynamic and leakage power amounts to 100%) for
each array components of 8A, 8B, 8C, 2A and 2B arrays, respectively. Phys-
ical partitioning reduces total power dissipation of an array by changing the
proportion of dynamic and leakage power mostly in the memory cells and the
wordline drivers. By partitioning the unpartitioned array 8A with Ndbl = 16,
Ndwl = 4 and Ndbl = Ndwl = 16 the “passive reading” dynamic power is
rapidly reduced resulting in total dynamic power reduction for memory cells
from 94% (in 8A) to 68% (in 8B) and to 22% (in 8C). However, the memory
118 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
80%
82%
84%
86%
88%
90%
92%
94%
96%
98%
100%
Mem
cells
Sense
Am
p
Writ
e Circ
uit
Word
Line
Drv
s
Row
/Col D
ecoder
s
Accu
racy
70%
75%
80%
85%
90%
95%
100%
Mem
cells
Sense
Am
p
Writ
e Circ
uit
Word
Line
Drv
s
Row
/Col D
ecoder
s
Accu
racy
80%
82%
84%
86%
88%
90%
92%
94%
96%
98%
100%
Mem
cells
Sense
Am
p
Writ
e Circ
uit
Word
Line
Drv
s
Row
/Col D
ecoder
s
Total A
rray
Pow
er
Accu
racy
a)
b)
c)
Figure 5.13: Accuracy in estimating: a) dynamic power, b) leakage power, c) total
power for 8-KB data arrays [blue/grey—8A, brown/black—8B, yellow/white—8C]
5.5. VALIDATION 119
80828486889092949698
100
Mce
lls
Sense
Amp
Writ
e Circ
uit
Wor
dLine
Drv
s
Row
/Col D
ecod
ers
Total D
yn. P
ower
Accu
racy (
%)
80828486889092949698
100
Mce
lls
Sense
Amp
Writ
e Circ
uit
Wor
dLine
Drv
s
Row
/Col D
ecod
ers
Total L
eak. P
ower
Accu
racy (
%)
80828486889092949698
100
Mce
lls
Sense
Amp
Writ
e Circ
uit
Wor
dLine
Drv
s
Row
/Col D
ecod
ers
Total A
rray Pow
er
Accu
racy (
%)
a)
b)
c)
Figure 5.14: Accuracy in estimating: a) dynamic power, b) leakage power, c) total
power for 2-KB data arrays [blue/grey—2A, brown/black—2B]
120 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
0
10
20
30
40
50
60
70
80
90
100
Mem
cells
Sense
Amp
Writ
e Circ
uit
Wor
dLine
Drv
s
Row
/Col D
ecod
ers
Total L
eak an
d Dyn
amic
Perc
en
tag
e (
%)
Figure 5.15: The proportion of dynamic (in brown/black) and leakage (in blue/grey)
power in the 8A, 8B and 8C arrays.
0
10
20
30
40
50
60
70
80
90
100
Mem
cells
Sense
Amp
Writ
e Circ
uit
Wor
dLine
Drv
s
Row
/Col D
ecod
ers
Total D
ynam
ic a
nd L
eaka
ge
Perc
en
tag
e (
%)
Figure 5.16: The proportion of dynamic (in yellow) and leakage (in orange) power in
the 2A array. The proportion of dynamic (in blue) and leakage (in brown) power in the
2B array.
5.5. VALIDATION 121
cells leakage power is also rapidly increased, from 6% (in 8A) to 32% (in 8B)
and to 78% (in 8C). This trend makes memory cell leakage more visible in the
partitioned arrays.
A similar trend also is shown in Fig. 5.16. By partitioning the 2 kB array
with Ndbl = 8 and Ndwl = 4, the dynamic power, which is the dominant in
the unpartitioned array 2A, exchanges its position with the leakage power that
becomes the dominant in the partitioned array 2B. Since partitioning requires
more global/local wordline drivers, for which an increasing fraction is inactive,
the proportion of wordline driver leakage power significantly increases (11%).
As a result, after partitioning, the leakage power constitutes as much as 45.5%
of the array’s total power dissipation.
In comparison to the unpartitioned array 2A, partitioning reduces active
power by 34.5%. However, it also increases leakage power by 75.8%! This re-
sult clearly suggests that the partitioning parameters obtained from CACTI 3.2
are not suitable for partitioning the given 2 kB array to reduce the total power
dissipation in general, and the leakage power in particular. Furthermore, a con-
clusion can be made that the effect of partitioning on an array’s power depends
strongly on the selection of technology process.
5.5.3 Validation of Power Models for SRAM-based
Tag Arrays
By applying directly the same modeling methodology, which has been used
to obtain power models for data SRAM arrays, to SRAM-based tag arrays the
power models for the comparator are obtained. Together with the power mod-
els of other array components, the comparator’s power models are used in this
section to provide the power dissipation estimates of a partitioned 2-KB SRAM-
based tag array for a validation againts the Hspice simulated values.
122 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
a)
b)
80
84
88
92
96
100
Mce
lls
Sense
Amp
Writ
e Circ
uit
Wor
dLine
Drv
s
Row
/Col D
ecod
ers
Com
para
tor
Total D
yn. P
ower
Accu
racy (
%)
80
84
88
92
96
100
Mce
lls
Sense
Amp
Writ
e Circ
uit
Wor
dLine
Drv
s
Row
/Col D
ecod
ers
Com
para
tor
Total L
eak Pow
er
Accu
racy (
%)
Figure 5.17: Accuracy in estimating: a) dynamic power, b) leakage power for a 2-KB
SRAM-based tag array
Fig. 5.17 shows accuracy values in estimating dynamic (part a) and leakage
(part b) power dissipation for the tag array. As discussed in Section 5.4.3, a tag
array comparator dissipates power dynamically only in the mismatch-case, oth-
erwise it is “leaking” in the match-case. Therefore, the accuracy values shown
in Fig. 5.17a are for the mismatch-case (i.e. when a cache miss occurs), and
those shown in Fig. 5.17b are for the match-case only.
5.5. VALIDATION 123
0,0
0,5
1,0
1,5
2,0
2,5
3,0
Mce
lls
Sense
Amp
Writ
e Circ
uit
Wor
dLine
Drv
s
Row
/Col D
ecod
ers
Com
para
tor
Inte
rnal C
TR
Total A
rray Pow
er
Po
wer
(mW
)
Figure 5.18: Total power dissipation of a 2-KB partitioned SRAM-based tag array
(blue/grey) and a 2-KB partitioned data array (brown/black)
Fig. 5.18 show the Hspice simulated total power dissipation of a 2-KB par-
titioned SRAM-based tag array (in blue/grey), a 2-KB partitioned data array (in
brown/black) implemented in a 65-nm BPTM bulk CMOS process, and its com-
ponent’s power values. The tag array is physically partitioned with Ntwl= 1,
Ntbl= 8, whereas the data array is partitioned with Ndwl= 4, Ndbl= 8 forming
a complete 2-KB SRAM-based cache. The data array has 128 rows of 128 6T-
SRAM cells in each, while the tag array has 128 rows and each row consists
of 22 6T-SRAM cells. Although both arrays dissipate nearly the same amount
of total power, they have very different power break-downs. While the major
contributors to total array power dissipation in the data array are memory cells
and wordline drivers, in the tag array it is the write circuit. The main reasons
for these differences are: (i) the number of write circuits used in the tag array
(22) is larger than the number of write circuits used in the data array (8); (ii) a
smaller number of the memory cells used in the tag array (22 × 128) in com-
124 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
parison to the one used in the data array (128 × 128). In addition, the internal
control circuits also dissipate a significant amount of power in both arrays.
The proposed models achieve very high accuracy in estimating both dy-
namic (97%) and leakage (96%) power for the memory cells, the SAs, and the
wordline drivers. However for the write circuits, the decoders and the com-
parator the obtained accuracy values are not impressively high. There are more
than one reasons for this: first, since the modeling methodology, which has been
used to obtain power models for data SRAM arrays, is directly applied to obtain
the power models for comparator — a dynamic-style circuit — there should be
some features that have not been yet captured in the obtained power models.
This problem requires further research work. Second, the total power dissipa-
tion of a SRAM-based tag array depends strongly on the cache hit ratio Hcache
that normally is maintained as high as 99% [16]. Thus, for the comparator the
dynamic power is not an issue.
5.6 Thermal and Variability Issues
In the presence of temperature variations, while dynamic power dissipation re-
mains unchanged, static power does not; the subthreshold component exponen-
tially depends on temperature. In the presence of supply voltage variations,
while switching power quadratically depends on voltage, both subthreshold and
gate leakage components are exponentially dependent on voltage [3]. There-
fore, in a power modeling approach that targets very high accuracy, temperature-
and supply-voltage-aware leakage power modeling become unavoidable.
5.6.1 Modeling the Dependence of Leakage on Temperature
To avoid the complexity of an analytical approach, while maintaining a high de-
gree of accuracy and flexibility, a simulation-based approach for temperature-
aware leakage power estimation has been used. At any fixed temperature, the
proposed power models offer high accuracy (as 96% [17]) in estimating total
5.6. THERMAL AND VARIABILITY ISSUES 125
power as well as dynamic and leakage power dissipation for both partitioned
and unpartitioned SRAM arrays. To preserve high accuracy in the presence of
temperature variations, the power models are extended to a number of temper-
atures, through a systematic extension of simulation points.
The first priority in assuring high accuracy in estimating total power dissi-
pation is to accurately capture the dependence of leakage on temperature for the
memory cells. It is possible to introduce temperature-dependent power models
for all other memory blocks too, but since memory-cell leakage power is the
main constituent of total leakage power dissipation of an SRAM array (approx-
imately 78% [11]) it may suffice to only consider the memory-cell model. In
the context of temperature dependent memory-cell power modeling, this will
be translated into a somewhat stricter accuracy requirement on the memory-
cell model, Acc mcellsleak , that in turn defines the number of temperature points at
which the memory-cell model needs to be defined.
To model the dependence of leakage power on temperature for a 6T-SRAM
cell, we need to (i) select the temperature range of interest by specifying Tlow
and Thigh; (ii) obtain two leakage power values by running a short DC sim-
ulation for the 6T-SRAM cell at the specified Tlow and Thigh; (iii) calculate
the number of simulation points using Eqs 5.56 - 5.57 with the given allowable
accuracy in estimating leakage power for memory cells, Acc mcellsleak ; (iv) obtain
leakage power values at specified temperature points by running a short DC
simulation with a temperature sweep.
NT.interval ≥Isub(at Thigh) − Isub(at Tlow)
2 Acc mcellsleak Isub(at Tlow)
(5.56)
Nsimulation point = NT.interval + 1 (5.57)
The number of simulation points is defined from the leakage power accuracy
defined for the entire temperature range. An example is showed in Fig. 5.19,
where the given temperature range of interest is 50–100 ◦C, Acc mcellsleak = 10%,
Isub(at Thigh)Isub(at Tlow) = 5, thus resulting in Nsimulation point = 22.
126 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
In Eq. 5.56, NT.interval is an integer that denotes the number of intervals
divided between simulation points in the selected temperature range.
30 40 50 60 70 80 90 100 1100
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6x 10
−8
Temperature (oC)
I su
b(A
)
Tlow
Thigh
Isub
(at Tlow
)
Isub
(at Thigh
)
Simulation points = 22
. . .
Figure 5.19: Subthreshold leakage power as a function of temperature for a 6T-SRAM
cell (commercial 130-nm)
For a typical SRAM, the leakage power of other memory components, P othersleak ,
constitutes approximately one fifth of the total array’s leakage power, Pleak [11].
As mentioned before, deploying temperature modeling for other components
can be ignored to simplify the temperature-modeling approach. In this case,
although the accuracy requirement on memory cell becomes stricter than in the
original case, it does not cause any fundamental problems neither significant
changes in our power models. Eqs 5.58 - 5.60 are used to define the accuracy
requirement on memory cells with given the accuracy in estimating total leak-
age power, Acc leak, Ratio leakothers/total – the ratio between P others
leak and Pleak ,
and Ratio leakmcells/total – the ratio between the leakage power of memory cells,
P mcellsleak , and Pleak .
Acc mcellsleak =
Acc leak
Ratio leakmcells/total
−Ratio leak
others/total
Ratio leakmcells/total
(5.58)
5.6. THERMAL AND VARIABILITY ISSUES 127
Ratio leakmcells/total =
P mcellsleak
Pleak(5.59)
Ratio leakothers/total =
P othersleak
Pleak(5.60)
5.6.2 Modeling Leakage with Variation in Supply Voltage
A common method to efficiently reduce total power dissipation is to reduce
the supply voltage, since switching power has a quadratic dependence on Vdd
while leakage power has an exponential one. Several leakage-reduction tech-
niques at the circuit level have been utilized for architecture-level leakage-
control schemes: Either the power to cache lines can be cut off (i.e. “gated-
Vdd” schemes, in which leakage basically is eliminated entirely) or it can be
put at an intermediate voltage level (i.e. “drowsy” schemes) to guarantee mem-
ory data is retained. Drowsy schemes have received considerable attention; it
was shown [18] that total cache leakage energy was reduced by an average of
76% at a wakeup penalty, for a drowsy cache line, of no more than one cycle.
Drowsy caches can be implemented using simple control circuits to assign dif-
ferent voltage levels, called tranquility levels—V drowsytlevel , at different priority
levels, based on information of replacement policies used [19].
To model the leakage power for “drowsy” memories, the dominating leak-
age mechanisms need to be modeled only for the circuits that exhibit static
power in idle mode. Only the SRAM cells need to be driven by the intermedi-
ate voltage level; all other circuits can be power gated completely. Thus, only a
leakage model for the SRAM cell’s dependence on the power supply’s tranquil-
ity level is required.
From Eqs 3.2 - 3.5 and BSIM4 equations for threshold voltage [1] it is clear
that the subthreshold leakage current’s dependence on supply voltage is eVdd ,
which is, in comparison to its dependence on temperature, very straightforward.
Based on this observation, a physically-based analytical approach for modeling
the leakage dependence of memory cells on supply voltage is proposed.
128 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
For the sake of simplicity, linearly distributed voltages are assumed for
N drowsytlevel tranquility levels between the lowest possible operating voltage
(V drowsymin.tlevel ≈ VT +200 mV representing deep sleep mode) and the full supply
voltage. The relation between I mcellleak and V drowsy
tlevel is established on gate and
subthreshold leakage currents obtained by running a short DC simulation for a
6T-SRAM cell with the supply voltage varying from V drowsymin.tlevel to full supply
voltage. Then, total leakage power of memory cells is expressed as a function
of Vdd:
P mcellsleak (Vdd) = Nmcells Vdd I mcell
leak (Vdd) (5.61)
I mcellleak (Vdd) = I mcell
gleak (Vdd) + I mcellsubleak(Vdd) (5.62)
0.6 0.65 0.7 0.75 0.8 0.85 0.90
0.5
1
1.5
2
2.5
3
3.5
4x 10
−8
Vdd
(V)
Cu
rre
nt
(A)
Isleak
Igleak
Itotal
Isleak
= 1.45x10−10
+ 0.99 e−18.7+0.92V
dd
Igleak
= 2.14x10−10
e5.23V
dd
Figure 5.20: Gate and subthreshold leakage power as functions of Vdd for a 6T-SRAM
cell (BPTM 65-nm [20])
Both the gate and subthreshold leakage power of a memory cell have exponen-
tial dependencies on Vdd. However, since gate leakage power is very sensi-
5.6. THERMAL AND VARIABILITY ISSUES 129
tive to changes in the transistor gate voltage, it depends strongly on Vdd. sub-
threshold leakage power, on the other hand, is less sensitive to changes in Vdd,
which is reflected in a weaker exponential function. Fig. 5.20 shows the de-
pendence of gate and subthreshold leakage power on Vdd for a 65-nm BPTM
6T-SRAM cell with the maximum Vdd= 0.9 V, V drowsymin.tlevel= 0.6 V, N drowsy
tlevel = 8,
and T = 70◦C [20].
30 40 50 60 70 80 90 100 1100
0.5
1
1.5
2
2.5
3
3.5
4
4.5x 10
−8
Temperature ( oC)
I su
b(A
)
TT
FF
SS
Figure 5.21: The subthreshold leakage power’s dependence on temperature for a 6T-
SRAM cell (commercial 130-nm with process corners: SS, TT, FF)
5.6.3 Modeling the Dependence of Leakage on Process
Corner
The notion of process corners represents a straightforward to capture manufacturing-
induced device characteristic variations in simulation. The process corner TT
denotes the typical case for both NMOS and PMOS devices. This is the corner
all simulations routinely are based on, and so are all power models thus far in
this paper. The corner SS (Slow NMOS and PMOS), on the other hand, as-
130 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
sumes the slowest possible devices leading to the lowest leakage, whereas FF
(Fast NMOS and PMOS) conversely yields the highest leakage.
During design exploration of SRAM arrays, evaluation of process corners
can prove useful to understand how device variability impacts resulting leakage
power. Fig. 5.21 shows the subthreshold leakage power of a 130-nm 6T-SRAM
cell for the three different corners as function of temperature. As expected, the
different process corners give different memory-cell leakage power; the magni-
tude varies as much as 10×.
As it was shown in Fig. 5.21 for all process corners the obtained memory-
cell leakage power has a similar dependence on temperature and a similar de-
pendence on supply voltage. Not very surprising, type of process corner is
just another input dimension, next to e.g. temperature, which can be added to
the leakage power tables. Since the proposed approach is fully parameteriz-
able with respect to memory size (an integer defines row count, while another
defines column count) only one instance of memory-cell power models will be
used. Therefore, the complexity in using the proposed method does not increase
with the added complexity of the core model of the memory cell.
Bibliography
[1] Univ. California Berkeley Device Group, BSIM4.2.1 MOSFET Model: User’s
Manual, Dept. of EECS, Univ. of California, Berkeley, CA 94720, USA, 2002.
[2] Y. Zhang et al., CS 2003-05: HotLeakage : A Temperature-Aware Model of Sub-
threshold and Gate Leakage for Architects, Dept. of CS, Univ. of Virginia, USA,
2003.
[3] W. Liao et al., “Temperature and Supply Voltage Aware Performance and Power
Modeling at Microarchitecture Level,” IEEE Trans. on CAD of ICS, vol. 24, no. 7,
pp. 1042–53, July 2005.
[4] D. Tarjan et al., HPL 2006-86: CACTI4.0, HP, 2006.
BIBLIOGRAPHY 131
[5] M. Mamidipaka et al., CECS 04-28: eCACTI: An Enhanced Power Estimation
Model for On-chip Caches, CECS, Univ. of California, Irvine, USA, 2004.
[6] International Technology Roadmap for Semiconductors, http://public.itrs.net,
ITRS, 2006.
[7] M. Q. Do, P. Larsson-Edefors, and L. Bengtsson, “Table-based Total Power Con-
sumption Estimation of Memory Arrays for Architects,” in Proceedings of Inter-
national Workshop on Power and Timing Modeling, Optimization and Simulation
(PATMOS’04), LNCS 3254, Sept. 2004, pp. 869–878.
[8] M. Q. Do, Mindaugas Draždžiulis, and P. Larsson-Edefors, “Current probing
methodology for static power extraction in sub-90nm cmos circuits, technical re-
port no. 2007-07,” Tech. Rep., The Department of Computer Science and Engi-
neering, Chalmers University of Technology, Göteborg, Sweden, 2007.
[9] T. Wada et al., “An Analytical Access Time Model for On-Chip Cache Memories,”
JSSC, vol. 27, no. 8, pp. 1147–56, Aug. 1992.
[10] A. Chandrakasan et al., Design of High-Performance Microprocessor Circuits,
IEEE Press, 2001.
[11] M. Q. Do, Mindaugas Draždžiulis, P. Larsson-Edefors, and L. Bengtsson, “Pa-
rameterizable Architecture-level SRAM Power Model Using Circuit-simulation
Backend for Leakage Calibration,” in Proceedings of International Symposium
on Quality Electronic Design (ISQED), March 2006, pp. 557–563.
[12] S.J.E. Wilton and N.P. Jouppi, WRL Research Report 93/5: An Enhanced Access
and Cycle Time Model for On-chip Caches, Western Research Laboratory, 1994.
[13] A. P. Chandrakasan and R.W. Brodersen, “Minimizing Power Consumption in
Digital CMOS Circuits,” Proceedings of the IEEE, vol. 83, no. 4, pp. 498–523,
April 1995.
[14] A. Karandikar et al., “Low Power SRAM Design Using Hierarchical Divided Bit-
line Approach,” in ICCD 1998, Oct. 1998, pp. 82–8.
[15] K. Pagiamtzis and A. Sheikholeslami, “Content-Addressable Memory (CAM) Cir-
cuits and Architectures: A Tutorial and Survey,” IEEE Journal of Solid-State Cir-
cuits, vol. 41, no. 3, pp. 712–727, Mar. 2006.
[16] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Ap-
proach, Morgan Kaufmann, fourth edition, 2006.
132 CHAPTER 5. MODULAR APPROACH TO POWER MODELING
[17] M. Q. Do, Mindaugas Draždžiulis, P. Larsson-Edefors, and L. Bengtsson,
“Leakage-Conscious Architecture-Level Power Estimation for Partitioned and
Power-Gated SRAM Arrays,” in Proceedings of International Symposium on Qual-
ity Electronic Design (ISQED), March 2007.
[18] K. Flautner et al., “Drowsy Caches: Simple Techniques for Reducing Leakage
Power,” in ISCA 2002, May 2002, pp. 148–57.
[19] N. Mohyuddin et al., “Controling Leakage Power with the Replacement Policy in
Slumberous Caches,” in CF 2005, May 2005, pp. 161–70.
[20] Y. Cao et al., “New paradigm of predictive MOSFET and interconnect modeling
for early circuit design,” in CICC 2000, 2000, pp. 201–4.
6Conclusion and Future Work
In this chapter, conclusions on the presented work are given and future work on
both the power modeling part and on the implementation of the proposed power
models in a high-level power-performance simulator is discussed.
6.1 Conclusion
Following Moore’s Law the number of transistor integratable on a chip is dou-
bled every two years, exponentially increasing the leakage power every two
years. The increase in the number of transistor leads to one of the most difficult-
to-solve problems for semiconductor industry: leakage power dissipation. Al-
though sub-threshold leakage still remains the main contributor to total leak-
age, other mechanisms such as gate oxide tunneling and junction (BTBT) leak-
133
134 CHAPTER 6. CONCLUSION AND FUTURE WORK
age are of increasing significance. When the total leakage power is approach-
ing about 50% of total power, further supply voltage scaling for normal MOS
transistors will not make any sense. The problem is that the scaling of the
threshold voltage gives rise to even more leakage power. This reason puts se-
rious demands on low power design, leakage control and reduction techniques,
and eventually on leakage power estimation tools. Therefore, accurate leakage
power estimation is needed to allow designers to make good design trade-offs
at higher, architectural design levels.
Since all leakage mechanisms are closely related to the physical behavior
of MOS transistors, circuit-level simulators are needed in order to maintain a
high accuracy in estimating leakage power dissipation. However, this high ac-
curacy comes at an extremely high cost in the form of computational complex-
ity since those circuit-level simulators are built on very complex, technology-
dependent and detailed analytical power models, e.g. BSIM3 or BSIM4. Ob-
viously, circuit-level simulation alone is not a viable solution. However, on the
other hand, as shown in Section 1.3 of this dissertation, neither simplified ana-
lytical leakage power models can be the solution to the conflicting requirements
enforced on leakage power estimation: high accuracy, flexibility and simplicity.
This is the area in which our research work intends to contribute.
This dissertation presents a modular, hybrid power modeling methodology
capable of capturing accurately both dynamic and leakage power mechanisms
for SRAM-based memory structures like on-chip caches and SRAM arrays.
The methodology successfully combines the most valuable advantage of circuit-
level power estimation – high accuracy – with the flexibility of higher-level
power estimation while allowing for short component characterization and es-
timation time. The methodology offers high-level parameterizable, but still ac-
curate power dissipation estimation models that consist of analytical equations
for dynamic power and pre-characterized leakage power values stored in tables.
6.2. FUTURE WORK 135
Through verification for a number of SRAM arrays and on-chip caches with
different configurations implemented in 0.13-µm and 65-nm CMOS processes,
the proposed power models show a high accuracy in estimating both dynamic
and static power for all the SRAM array and cache components.
In order to capture correctly the total leakage currents of sub-90nm logic cir-
cuits when circuit simulators, such as Hspice, are employed, a methodology for
probing circuits for static current measurements in CMOS circuits during sim-
ulation has been proposed. In the power modeling validation part, the proposed
probing methodology has been used successfully to obtain accurate and distin-
guishable static power constituents (i.e. gate, sub-threshold and total leakage
power) for several unpartitioned and physically partitioned data SRAM arrays
and a SRAM-based tag array implemented in a BPTM 65-nm process.
In addition, a modeling methodology to capture the dependence of leakage
power on temperature variation, on supply-voltage scaling, and on the selection
of process corners has also been presented. This methodology provides an es-
sential extension to the proposed power models.
The proposed power modeling methodology and power models, as far as
we know, are the first ones that can offer high-level, parameterizable, relatively
simple and high-accuracy cache power estimation models accounting for both
dynamic and static power consumption.
6.2 Future Work
The following is a list of major tasks that are subject to future work:
1. As mentioned in Section 5.5.3, the obtained power models for a com-
parator of a SRAM-based tag array have not captured correctly all power
136 CHAPTER 6. CONCLUSION AND FUTURE WORK
dissipation mechanisms existed in the comparator yet, therefore still suf-
fering from low accuracy values in estimating leakage power (i.e. 80%).
This problem requires some additional research to solve.
2. Thermal management/Hot spot identification is an emerging issue due
to technology scaling. By knowing the leakage power density within a
chip, it is possible to obtain a thermal map for it. Thus, coupling power
modeling to thermal map is a good topic for future research.
3. There is a need to implement our proposed power models into a existing
power simulator, e.g. CACTI, to improve its power dissipation estimates.
This is also a good topic for future research.
4. The proposed power modeling methodology is modular and applicable
to any type of components of regular structures which can satisfy the
following two main requirements: (i) The number of internal hardware
block/cell instances is finite; (ii) The netlists of typical components are
provided. Therefore, potential candidates-components for future works
can be Content-Addressable-Memory (CAM) and clocking network.
Part IV
Appendix
ADSP-PP – A Power Estimation and
Performance Analysis Tool for
Parallel DSP Architectures
This chapter devotes for the description of the work done for designing and im-
plementing an architecture-level cycle-accurate power-performance simulator
for parallel DSP architectures (DSP-PP). Section A.1 gives some background
information on the special characteristics of DSP architectures. The following
Section A.2 describes in detail the design of DSP-PP simulator and its usage in
estimating performance and power consumption of DSP parallel architectures.
139
140 APPENDIX A. DSP-PP SIMULATOR
A.1 Characteristics of DSP Architectures
Comparing to microprocessors, DSP architectures have the following special
characteristics:
1. A fixed-point DSP usually has single/multiple MACs each of which con-
sists of a single-cycle multiplier (or pipelined one), a fixed-point ALU
operating on double wordlength operands, double wordlength accumula-
tors, shifters and registers. A floating-point DSP usually has single/multiple
floating-point MACs used together with single/multiple fixed-point or
floating-point ALUs. DSP usually provides good support for saturation
arithmetic, rounding and shifting.
2. In order to save cost and reduce energy consumption, DSPs tend to use
the shortest dataword and lowering the clock frequency to the minimum
possible value that will provide adequate accuracy in the target applica-
tions. Data words of most fixed-point DSPs are 16-b, or 32-b.
3. A combination of several special-purpose registers (e.g. accumulators)
and general-purpose register files is used. Number of registers is usually
fewer than the one used in microprocessors.
4. Separated small multiple-ported data (e.g. two data memories for X and
Y operands) and program memories are used. Data memories usually are
built in-core with the size within 64-KB. Program memory can be built
either on-core or off-core with a relatively larger size than the size of data
memories. A small single-level instruction cache can be used to store
program. Some DSPs also uses unified data-program memory structure,
but Harvard memory structure is mostly used.
5. Multiple busses are used to communicate between datapaths and the mem-
ory sub system (on-chip or off-chip), between DSP core and peripherals,
etc. DSP usually has specialized interfaces, e.g. Analogue-to-Digital and
Digital-to-Analogue.
A.1. CHARACTERISTICS OF DSP ARCHITECTURES 141
6. A simple instruction pipeline system is used. Since DSP assumes that
data dependency is known and data flow is predictable (it is always pos-
sible for DSP applications) so there is no “out-of-order” issue and execu-
tion of instructions.
7. A simplified instruction set consisting mostly of simple and simplified
instructions for datapath functions (i.e. about 80% of the total instruc-
tions is most frequently used DSP instructions, other 20% is multi-clock
complex instructions) is used. DSP has instruction-level functional ac-
celeration (i.e. the ability to accelerate and merge most frequently used
DSP instructions into subroutines) therefore DSP instructions are very
efficient and DSP code size is small.
8. Register-register (register direct), memory-memory (memory direct), cir-
cular addressing, immediate data, register indirect, and register-indirect
with post-increment addressing modes are used. These addressing modes
require complex data memory addressing circuits.
9. A hardwired instruction decoder is used generating control signals for
datapaths.
10. Most DSP applications and algorithms are implemented in Assembler,
and sometimes in C
11. Parallel DSPs usually achieve a high level of parallelism by combining
multiple cores with the same architecture in its system (e.g. the ManAr-
ray - BOPS DSP architecture). Communication and data transfer between
cores are provided by switching fabrics and system buses that are capable
of interconnecting and organizing a set of cores into standard ring, mesh,
torus, hypercube, and other organizations. Local parallelism is achieved
mainly by using multiple datapaths, multiple resources (e.g. the DSP
VLIW architecture) and by instruction pipelining.
142 APPENDIX A. DSP-PP SIMULATOR
A.2 DSP-PP
A.2.1 Features of the DSP-PP
The DSP-PP is a cycle-accurate performance simulator and power consumption
estimator for parallel DSP architectures. The DSP-PP has been designed by us-
ing an object-oriented approach and written in C++ with the SystemC library
to provide high level of abstraction and encapsulation as well as flexibility and
extendibility of the simulation program. The block diagram of the DSP-PP sim-
ulator/estimator is shown in the Fig. A.1.
The first version of the simulator (i.e. DSP-PP version 1.0) was imple-
mented using analytical power models that were developed based mainly on
Wattch power models with added components for leakage power dissipation
estimation [1]. However, these modified Wattch power models offer low accu-
racy values (i.e. as many as 15% - 70%) in estimating power dissipation com-
paring to the power value obtained by using a circuit-level power dissipation
tool such as HSPICE [2]. These accuracy values do not satisfy sufficiently the
architecture-level requirement on accuracy, and therefore modified analytical
Wattch power models can not be used in our DSP-PP simulator. This problem
has triggered some research ideas leading us to the introduction of the WTTPC
approach and the table-based power dissipation models that are implemented in
the current version of DSP-PP simulator. The description of the implemented
simulator (version 2.0) is given in more detail in the Section A.2.2 below.
The DSP-PP consists of two components: the Cycle-level Performance Sim-
ulator (CPS) and the Power-Dissipation Estimator (PDE).
Cycle-level Performance Simulator (CPS):
The CPS is an execution-driven cycle-accurate performance simulator. The
main functions of CPS are as follows:
A.2. DSP-PP 143
DSP Power-Performance Estimator
Cycle-by-cycle
Performance
SimulatorPower
Dissipation
Estimator
PerformanceEstimate
PowerConsumption
Estimate
TraceHardwareConfiguration
Program Executable orCompiled Benchmark
Process Technology
Parameters
Power Models (Tables)
Figure A.1: Block Diagram of the DSP Power Performance Simulator
1. Accepts as input, an executable program obtained by compilation of input
benchmarks as well as the PE/DSP configurations. (Here, PE denotes a
processor element).
2. Simulates, cycle-by-cycle, instruction execution and dataflows between
PE components as well as between parallel DSP architecture components.
3. Generates output performance statistics (i.e. program cycle counts) and
cycle-by-cycle traces.
Using object-oriented programming techniques, all components are mod-
eled as objects. Each object accepts certain type of input data, performs cer-
tain functions, and generates defined outputs. Moreover, each object also has
a power consumption model and a hardware access count that can be sent di-
rectly to the PDE to create the power consumption estimation. The communi-
cation between objects and the order of that communication are handled by an
event-scheduler, the Simulator Engine, which is the core module of the DSP-PP
simulator.
Power Dissipation Estimator (PDE):
The PDE consists of power consumption models for DSP components and a
total-power-estimation-engine module used to calculate the overall power dissi-
pation of the entire DSP parallel architecture in a cycle-by-cycle manner. These
144 APPENDIX A. DSP-PP SIMULATOR
power models include WTTPC-tables of power values for memory arrays and
similar types of components; parameter sets for other types of DSP components
such as arithmetic-logic circuits, etc. The main functions of PDE are as follows:
1. Accepts as input, cycle-by-cycle traces from CPS for different hardware
components involved in the DSP parallel architecture, as well as the
PE/DSP configuration and the configuration of the entire DSP parallel
architecture.
2. Generates power estimation values in a cycle-by-cycle manner for the
given configuration.
In order to reduce the number of WTTPC-tables created for each compo-
nent, similar to what was done in [8], the PDE is designed so that it can inter-
polate (using curve-fitting interpolation functions) between component charac-
terization points covering the entire possible design range of that component.
A.2.2 Description of the DSP-PP Simulator (Version 2.0)
The Cycle-level Performance Simulator has been fully implemented for the ex-
tended ManArray DSP architecture. The Power Dissipation Estimator is par-
tially implemented and power consumption models for all DSP components are
still under our on-going research works. The implementation of DSP-PP was
done in a Master thesis project by Firas Milh [3]. This section gives a brief
description of the implemented DSP-PP Simulator version 2.0.
The program code for the simulator is written in Visual C++ 6.0 using the
SystemC library. The code is divided into two projects. One project contains the
simulator and the other project contains the graphical user interface. The simu-
lator is divided into a collection of files where every implemented unit has two
files associated with it. One declaration file (*.h) and one implementation file
(*.cpp). There are also files associated with the main program, the shared mem-
ory and different classes used for simulator unit communication. The project
for the GUI is a Microsoft Windows project based on dialog boxes.
A.2. DSP-PP 145
System Overview
In order to fulfill the design features of the DSP-PP defined in the Section A.2.1
above, several changes were made to the ManArray model and the assembly
language. Among the most important modifications is the number of units in
each PE. Instead of the fixed five units, additional ALUs, MAUs, and DSUs are
supported. For each of these three unit types there can be at most 10 units which
together with the single LU, and SU sums up to a total of 32 execution units per
PE in the widest configuration. This change results in additional VLIW mem-
ory, memory ports and a few additional registers to keep track of the status of
each unit. Another generalization is the ability to have an arbitrary number of
PEs connected to the SP. This flexibility is limited only by the available sys-
tem resources of the host machine. The Cluster Switch is resized dynamically
to accommodate the number of PEs. A limitation of the simulator is the lack
of support for some of the instructions in the ManArray instruction set and the
DMA capability [3].
The modifications made the original ManArray architecture to be a very
flexible parallel DSP architecture capable of changing number of attached PEs
to each core as well as the number of execution units inside each SP/PE (within
32 units), reconfiguring cluster switches to handle connection between any
number of PE and SP needed. Therefore, this extended version of ManAr-
ray architecture can serve as a base for other types of DSPs. For example, with
only a single active MAC (i.e. a MAU) and with iVLIW pipeline organization
turned off the architecture resembles a “simple” DSP while a general-purpose
DSP VLIW can be resembled by this architecture with fully active five execu-
tion units together with the VILW pipeline organization turned on.
This extended ManArray architecture allows users to elaborate their ideas
using different number of functional units, different sizes of register files, differ-
ent sizes of memories, etc. in their exploration of the DSP architecture design
space.
146 APPENDIX A. DSP-PP SIMULATOR
Cycle Accurate Modeling
The simulator executes given code cycle by cycle registering important events
and bit transitions within the architecture model. There are counter variables
built in to the simulator that keep track of all important accesses, unit activities,
and bit flips. At every clock cycle every type of accesses is registered both
by the counter variables inside of the simulator and also written to files with
one file per Processing Element. The files have the format of comma separated
lists with one row for each cycle and one entry per counter variable in each row.
These files can be read by Microsoft Excel or other spread sheet software which
makes calculation and manipulation of the statistic data rather straight forward.
There is also a small program written to accompany the simulator that reads the
count variables directly from the simulator through shared memory.
Active Units
Every unit inside the SP and PE in the simulator model has an activity counter
which keeps track of the number of clock cycles that the unit is active. Every
memory read is registered in a access count structure with three elements for
every type of read. The first element of the count structure holds information
about actual independent reads, the second holds the number of zeroes read, and
the third holds the number of ones read. These counter are separate for each SP
and PE. Every memory write is also registered in a access count structure with
five elements for every type of write. The first element of the count structure
holds information about actual independent writes, the second holds the number
of writes where a zero is written to a bit containing a zero, the third holds the
number of writes where a zero is written to a bit containing a one, the fourth
holds holds the number of writes where a one is written to a bit containing a
zero, and the fifth holds holds the number of writes where a one is written to a
bit containing a one. These counter are also separate for each SP and PE.
A.2. DSP-PP 147
Implementation Overview
At the topmost level the Sequence Processor (SP), an array of Processing Ele-
ments (PEs), and the cluster switch is declared and connected with appropriate
signals. Each of the units at this level have a main clock signal for synchroniza-
tion purposes. Fig. A.2 and Fig. A.3 shows the interconnection of components
inside a PE and a SP of the simulator, respectively.
VIMs
(Shared
Memory)
IR2
IR1
CF
Decode
CF
Exec
Branch
EPLoop
Instruc.
Memory
Figure A.2: Interconnection of components inside a SP of the extended ManArray ar-
chitecture [3]
The SP is connected to each of the PEs with a set of instruction carrying
signals. These signals are used by the SP to dispatch instructions directly to
the corresponding port of each of the PEs units. There are two sets of signals
carrying instructions from the SP to each unit of each PE. Each set is associ-
ated with one of the two pipeline modes: Normal Pipeline (NP) and Extended
Pipeline (EP). The instruction port set associated with the NP is a single set and
is dispatched from the IR1 unit of the SP to the decode stage of each of the
execution units of each PE. Since these ports are single each PEs connected to
148 APPENDIX A. DSP-PP SIMULATOR
SU
Decode
SU
Exec
LU
Decode
LU
Exec
ALU
Decode 0
ALU
Exec 0
ALU
Decode n
ALU
Exec n
MAU
Decode 0
MAU
Exec 0
MAU
Decode n
MAU
Exec n
DSUDecode 0
DSU
Exec 0
DSU
Decode n
DSU
Exec n
SP
Data
RFShared
Memory
PostCND
Figure A.3: Interconnection of components inside a PE of the extended ManArray ar-
chitecture [3]
A.2. DSP-PP 149
these ports sees the same information a every clock cycle. The instruction port
set associated with the EP is dispatched from IR2 of the SP and multiplied by
the number of PEs so that each of the PEs can have an independent array of in-
struction issued at every clock cycle which is necessary when executing VLIW
instructions.
Each PE has only one Load unit (LU) and one Store unit (SU) while multiple
instances of ALUs, MAUs and DSUs which all have separate ports from IR2 of
the SP. There is a separate instruction signal from each of the pipeline modes
to the Cluster Switch (CS). The last port of the SP is used for signalling control
flow information to the decode stage of each unit in each PE.
Configuration Files
The information needed for the simulator to run is mainly the number of PEs
and the mode of the instruction set. There are two modes that can be chosen,
one is the original instruction set for the BOPS ManArray, and the other is the
extended instruction mode which allows multiple execution units inside the PEs.
There is a main configuration file that holds the information needed for the
simulation. The file is named config_a.txt and resides at the same location as
the executable file of the simulator. An example of a configuration file follows:
3# NUM_PE - number of PEs
1# NUM_ALU - number of ALUs
1# NUM_MAU - number of MAUs
1# NUM_DSU - number of DSUs
0# MULTIPLE UNIT PE
-1# EOF
Output Files
As the simulator executes a program, the number of access counts for each
counter are stored in files corresponding to each of the PEs. These files reside
150 APPENDIX A. DSP-PP SIMULATOR
in the folder ’Stats’ and are are named stat pe*.csv where the star is replaced
with the number the PE it represents. There will be as many files as there is PEs
in the simulation. These files are overwritten every time the simulator runs. To
save interesting results the user must copy these files after each simulation to
avoid loss of data.
The format of the file is semi colon separated lists. Each row represents one
clock cycle of the simulation and each column represents one access count. The
first row shows the name of the counter. These files can easily be opened and
manipulated with spreadsheet programs like Microsoft Excel.
Graphical User Interface
The graphical user interface (GUI) is very simple and offers some possibilities
for interaction. There are two main actions that can be invoked through the GUI
as stepping one clock cycle in the program code and running simulator without
interruption. There is also a pull down menu at the top which gives the user the
ability to choose from which PE the information is to be displayed in the PE
related areas of the GUI, (see Fig. A.4).
The GUI is divided into several areas showing different kind of informa-
tion related to the program execution. There are two areas showing the pipeline
state. One area shows the hexadecimal values of the instructions and the other
shows the mnemonics of the instructions. These areas show the information
related to the selected PE only. Below there are areas for the different memo-
ries including VIM, Special Purpose Registers, Register File, and RAM. At the
bottom there are two areas with the Access Count variables listed. One for the
SP and one for the selected PE. For every clock cycle the values that has been
changed are marked with square brackets to make tracing easier. Differences
are also marked when the user switches between the different PEs so that the
differences are easy to spot [3].
A.2. DSP-PP 151
Figure A.4: The GUI of our implemented DSP-PP simulator
152 APPENDIX A. DSP-PP SIMULATOR
Current Limitations of Simulator
The main current limitation of the simulator is the incomplete implementation
of the instruction set. About 80% of the total ManArray instructions was imple-
mented. Those instructions are the most frequently used ones and all the basic
functionality was implemented. The implementation of the remaining instruc-
tions is rather straight forward. Another limitation is the lack of support for
DMA and interrupts [3].
Bibliography
[1] M. Q. Do, L. Bengtsson, and P. Larsson-Edefors, “Models for Power Consumption
Estimation in the DSP-PP Simulator,” in Proceedings of the International Signal
Processing Conference (ISPC03), Apr. 2003.
[2] M. Q. Do and L. Bengtsson, “Analytical models for power consumption estimation
in the dsp-pp simulator: Problems and solutions, technical report no. 03-22,” Tech.
Rep., The Department of Computer Engineering, Chalmers University of Technol-
ogy, Göteborg, Sweden, 2003.
[3] M. Firas, “Implementation of the DSP-PP Cycle-True Simulator Using SystemC,”
M.S. thesis, The Department of Computer Science and Engineering, Chalmers Uni-
versity of Technology, Göteborg, Sweden, 2004.