scalable hardware mechanisms for superscalar...

174
UNIVERSITY OF CALIFORNIA Irvine Scalable Hardware Mechanisms for Superscalar Processors A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical and Computer Engineering by Steven Daniel Wallace Committee in charge: Professor Nader Bagherzadeh, Chair Professor Nikil Dutt Professor Fadi Kurdahi 1997

Upload: others

Post on 01-Apr-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

UNIVERSITY OF CALIFORNIA

Irvine

Scalable Hardware Mechanisms for Superscalar Processors

A dissertation submitted in partial satisfaction of the

requirements for the degree Doctor of Philosophy

in Electrical and Computer Engineering

by

Steven Daniel Wallace

Committee in charge:

Professor Nader Bagherzadeh, Chair

Professor Nikil Dutt

Professor Fadi Kurdahi

1997

Page 2: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

c�1997

STEVEN DANIEL WALLACE

ALL RIGHTS RESERVED

Page 3: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

The dissertation of Steven Daniel Wallace is approved

and is acceptable in quality and form for

publication on microfilm:

Committee Chair

University of California, Irvine

1997

ii

Page 4: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Dedication

To my parents,for their never-ending love and support.

iii

Page 5: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Contents

List of Figures � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � vii

List of Tables � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ix

Table of Symbols � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � x

Acknowledgements � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xii

Curriculum Vitae � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xiii

Abstract � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xv

Chapter 1 Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Chapter 2 Background � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 92.1 Instruction Fetch Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Fetching Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Software Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Dynamic Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Instruction Fetch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5 Multiple Block Fetching . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6 Register Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6.1 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.7 Register File Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Chapter 3 Experimental Methodology � � � � � � � � � � � � � � � � � � � � � 283.1 Simulation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 SPEC95 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Program Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Machine Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.1 Default Configuration . . . . . . . . . . . . . . . . . . . . . . . . 353.4 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4.1 Instructions Fetched Per Cycle (IFPC) . . . . . . . . . . . . . . . . 363.4.2 Branch Execution Penalty (BEP) . . . . . . . . . . . . . . . . . . . 373.4.3 Effective Instruction Fetch Rate (IPC f) . . . . . . . . . . . . . . . 373.4.4 Instructions Per Cycle (IPC) . . . . . . . . . . . . . . . . . . . . . 38

iv

Page 6: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Chapter 4 Instruction Fetching Mechanisms � � � � � � � � � � � � � � � � � 394.1 Fetching Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Hardware Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.1 Simple Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.2 Extended Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.3 Self-Aligned Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.4 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.5 Dual Branch Target Buffer . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Expected Instruction Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3.1 Simple Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3.2 Extended Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3.3 Self-aligned Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3.4 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3.5 Dual Block Fetching . . . . . . . . . . . . . . . . . . . . . . . . . 524.3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Chapter 5 Multiple Branch and Block Prediction � � � � � � � � � � � � � � � 665.1 Multiple Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Dual Block Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2.1 Single Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2.2 Double Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.2.3 Misprediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.3.1 Conditional Branch Accuracy . . . . . . . . . . . . . . . . . . . . 895.3.2 Block Information Type . . . . . . . . . . . . . . . . . . . . . . . 915.3.3 Single vs. Double Selection . . . . . . . . . . . . . . . . . . . . . 915.3.4 Target Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.3.5 Instruction Cache Configurations . . . . . . . . . . . . . . . . . . 945.3.6 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.4 Multiple Block Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.5 Cost Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.5.1 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.5.2 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Chapter 6 Scalable Register File � � � � � � � � � � � � � � � � � � � � � � � � 1176.1 Register Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.1.1 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.1.2 CAM/Table Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.1.3 Intrablock Decoding . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.2 Register File Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.3 Dynamic Result Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.3.1 Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

v

Page 7: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.4.1 Source Operand Renaming . . . . . . . . . . . . . . . . . . . . . . 1306.4.2 Destination Operand Renaming . . . . . . . . . . . . . . . . . . . 133

6.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Chapter 7 Conclusion � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 143

Chapter 8 Future Directions � � � � � � � � � � � � � � � � � � � � � � � � � � 148

Bibliography � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 151

vi

Page 8: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

List of Figures

1.1 Current Superscalar Cost and Performance Trends . . . . . . . . . . . . . . 51.2 Superscalar Cost and Performance Goals . . . . . . . . . . . . . . . . . . . 5

2.1 Pipeline Stages of a Superscalar Processor . . . . . . . . . . . . . . . . . . 102.2 Simple Fetching Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Pattern History Table and 2-bit Counter State Diagram . . . . . . . . . . . 152.4 2-Level Adaptive Branch Prediction . . . . . . . . . . . . . . . . . . . . . 162.5 Global History Adaptive Branch Prediction . . . . . . . . . . . . . . . . . 172.6 Per-Addr History Adaptive Branch Prediction . . . . . . . . . . . . . . . . 182.7 Block Diagram Schematic of the NLS Architecture . . . . . . . . . . . . . 202.8 Multiple Global Adaptive Branch Prediction . . . . . . . . . . . . . . . . . 222.9 Branch Address Tree and Cache Mapping . . . . . . . . . . . . . . . . . . 232.10 Two-block Ahead Branch Prediction . . . . . . . . . . . . . . . . . . . . . 232.11 Block Diagram of Renaming Logic . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Fetching Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2 Extended Fetching Example . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Self-aligned Fetching Example . . . . . . . . . . . . . . . . . . . . . . . . 444.4 Prefetch Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.5 Block Diagram of Dual Branch Target Buffer Entry . . . . . . . . . . . . . 474.6 Dual Branch Target Buffer Example . . . . . . . . . . . . . . . . . . . . . 484.7 Prefetch Buffer State Diagram . . . . . . . . . . . . . . . . . . . . . . . . 514.8 Expected Instruction Fetch without Prefetching . . . . . . . . . . . . . . . 544.9 Self-Aligned Expected Instruction Fetch with Prefetching (n � �) . . . . . 554.10 Self-Aligned Expected Instruction Fetch with Prefetching (n � �) . . . . . 564.11 Simple Expected Instruction Fetch with Prefetching . . . . . . . . . . . . . 574.12 Different Cache Techniques with Prefetching . . . . . . . . . . . . . . . . 584.13 Different Cache Techniques for Dual Block Fetching with Prefetching . . . 59

5.1 Multiple Global Adaptive Branch Prediction Example . . . . . . . . . . . . 695.2 Multiple Branch Prediction with Blocked PHT Example . . . . . . . . . . 705.3 Block Diagram of a Multiple Branch Prediction Fetching Mechanism . . . 725.4 Branch Selection Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.5 Block Diagram for Dual Block Prediction . . . . . . . . . . . . . . . . . . 795.6 Pipeline Stage Diagram for Dual Block Prediction . . . . . . . . . . . . . . 805.7 Block Diagram for Dual Block Prediction Using Double Selection . . . . . 83

vii

Page 9: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

5.8 Pipeline Stage Diagram for Dual Block Prediction Using Double Selection . 845.9 Branch Misprediction Rate and Improvement . . . . . . . . . . . . . . . . 905.10 Block Information Type Penalty and Performance . . . . . . . . . . . . . . 925.11 Single and Double Selection Performance . . . . . . . . . . . . . . . . . . 935.12 Branch Execution Penalties for Dual block, Single Selection . . . . . . . . 985.13 Branch Execution Penalties for Dual Block, Double Selection . . . . . . . 1005.14 Predicting Multiple Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.15 Effective Instruction Fetch for Different Block Prediction Capability . . . . 1045.16 Hardware Storage Cost of Prediction for Different Cache Sizes . . . . . . . 1085.17 Hardware Storage Cost of Dual Block Prediction for Single and Double

Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.18 Timing Chart for 8 KB Instruction Cache Using Dual Block Prediction with

Single Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.19 Timing Chart for Pipelined 32 KB Instruction Cache Using Dual Block

Prediction with Double Selection . . . . . . . . . . . . . . . . . . . . . . . 116

6.1 Block Diagram of Hybrid Renaming . . . . . . . . . . . . . . . . . . . . . 1206.2 Dependence Distance for SDSP/SPARC . . . . . . . . . . . . . . . . . . . 1226.3 Block Diagram of Scalable Register File . . . . . . . . . . . . . . . . . . . 1266.4 Register File Performance Comparison for a 4-way Superscalar . . . . . . . 1376.5 Register File Performance Comparison for an 8-way Superscalar . . . . . . 1386.6 BIPS and Cycle Time Performance Comparison for a 4-way Superscalar . . 1416.7 BIPS and Cycle Time Performance Comparison for an 8-way Superscalar . 142

viii

Page 10: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

List of Tables

3.1 Description of SPEC95 Applications . . . . . . . . . . . . . . . . . . . . . 313.2 Branch Attributes of SPEC95 Applications . . . . . . . . . . . . . . . . . 333.3 Functional Unit Quantity, Type, and Latency . . . . . . . . . . . . . . . . . 34

4.1 Expected Instruction Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2 Instructions Fetched per Cycle (n � �) . . . . . . . . . . . . . . . . . . . . 614.3 Instructions Fetched per Cycle with Prefetching (n � �) . . . . . . . . . . . 624.4 Instructions Fetched per Cycle with Prefetching (n � �) . . . . . . . . . . . 634.5 IPB and IFPC for Dual Block Fetching with Prefetching . . . . . . . . . . 65

5.1 Block Information Types and Prediction Sources . . . . . . . . . . . . . . 735.2 Next Line Prediction Example Based on Starting Position . . . . . . . . . . 755.3 Misprediction Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.4 Bad Branch Recovery Entry . . . . . . . . . . . . . . . . . . . . . . . . . 865.5 Indirect and Immediate Misfetch Penalty Comparison for Different Target

Array Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.6 IPB and IPC f for Different Cache Types . . . . . . . . . . . . . . . . . . . 965.7 BEP Distribution, IPB, and IPC f for Dual Block, Single Selection . . . . . 995.8 BEP Distribution, IPB, and IPC f for Dual Block, Double Selection . . . . 1015.9 Two-block Prediction with Prefetching for Different Decode Sizes . . . . . 1025.10 Simplified Hardware Cost Estimates . . . . . . . . . . . . . . . . . . . . . 1065.11 Access Time Estimates (ns) . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.1 Bad Branch Penalty and Performance . . . . . . . . . . . . . . . . . . . . 1216.2 Average Register File Utilization per Cycle . . . . . . . . . . . . . . . . . 1246.3 Read Operand Category Distribution (%) . . . . . . . . . . . . . . . . . . 1286.4 CAM Lookup Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.5 Mapping Table Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.6 Recovery List Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

ix

Page 11: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Table of Symbols

ParametersL Number of logical registersN Order of superscalar – the maximum issue rateP Number of ports in a RAM cellR Number of physical registersS Maximum number of speculative instructions allowed in pipelineb Probability an instruction transfers controlm Extended cache line sizen Maximum number of instructions in a decode blockp Size of prefetch bufferq Maximum number of instructions in a fetch block

FunctionsEi Probability the starting address in the block is at position iF Expected instruction fetch per cycleIi Probability exactly i instructions are fetchedLi Probability a control transfer occurs at position i in a blockPi Probability the prefetch buffer contains i instructionsc Probability of a control transfer in a blockr Expected block run length

AcronymsALU Arithmetic logical unitBAC Branch address cacheBBR Bad branch recoveryBEP Branch execution penaltyBHR Branch history registerBIPS Billions of instructions per secondBIT Block information type

BTB Branch target bufferCAM Content addressable memory

x

Page 12: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Acronyms (continued)DBTB Dual branch target buffer

DS Double selectionFIFO First-in first-outGHR Global history registerIFPC Instructions fetched per cycle

IPB Instructions per blockIPC Instructions per cycle

IPC f Effective instruction fetching rateIPFQ Instructions per fetch request

IW Instruction windowLRU Least recently usedNLS Next line and set

PC Program counterPHT Pattern history table

RAM Random access memoryRAS Return address stack

RF Register fileSMT Simultaneous multithreading

SS Single selectionST Select table

xi

Page 13: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Acknowledgements

I give special thanks to Professor Nader Bagherzadeh, my advisor, for his guidance

and support.

I thank my dissertation committee members, Professor Nikil Dutt and Professor Fadi

Kurdahi for their reading and evaluation of my dissertation work.

I thank all members of the computer architecture research group. I enjoyed working

with fellow graduate students from the Advanced Computer Architecture Laboratory and

the Fault-Tolerant Multicomputer Laboratory: Marcelo Moraes de Azevedo, Bill Brown,

Nirav Dagli, Manu Gulati, Joao Lacerda, Hung Liu, Nayla Nassif, Jesse Pan, Brian Park,

Mark Pontius, Simin Shoari, Koji Suginuma, and Honge Wang.

I also appreciate financial support I received from the Chancellor’s Fellowship and a

research assitantship.

xii

Page 14: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Curriculum Vitae

Steven Daniel Wallace

1969 Born in Burbank, California1988 Graduate of Woodbridge High School1992 B.S. in Electrical Engineering (magna cum laude),

B.S. in Information and Computer Science (cum laude),Minor in Applied Mathematics,University of California, Irvine

1992 Chancellor’s Fellowship, University of California, Irvine1993 M.S. in Engineering, University of California, Irvine

1993–1996 Research Assistant, Department of Electrical and Computer Engineering,University of California, Irvine

1996 Ph.D. in Electrical and Computer Engineering, with concentration inComputer Engineering, University of California, IrvineDissertation: Scalable Hardware Mechanisms for Superscalar Processors

Publications

S. Wallace and N. Bagherzadeh, “Multiple Block and Branch Prediction,”Third International Symposium on High-Performance Computer Architec-ture, Februrary 1997.

S. Wallace and N. Bagherzadeh, “A Scalable Register File Architecture forDynamically Scheduled Processors,” 1996 Conference on Parallel Archi-tectures and Compilation Techniques, October 1996.

S. Wallace and N. Bagherzadeh, “Instruction Fetching Mechanisms for Super-scalar Microprocessors,” Euro-Par ’96, August 1996.

S. Wallace and N. Bagherzadeh, “Performance Issues of a Superscalar Micro-processor,” Microprocessors and Microsystems, May 1995.

S. Wallace and N. Bagherzadeh, “Performance Issues of a Superscalar Micro-processor,” 23rd International Conference of Parallel Processing, August1994.

S. Wallace, “Performance Analysis of a Superscalar Architecture,” Master’sthesis, University of California, Irvine, September 1993.

xiii

Page 15: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Superscalar Design Papers

S. Wallace, N. Dagli, and N. Bagherzadeh, “Design and Implementation of a100 MHz Centralized Instruction Window for a Superscalar Microproces-sor,” 1995 International Conference on Computer Design, October 1995.

S. Wallace, N. Dagli, and N. Bagherzadeh, “Design and Implementation ofa 100 MHz Reorder Buffer,” 37th Midwest Symposium on Circuit andSystems, August 1994.

J. Lenell, S. Wallace, and N. Bagherzadeh, “A 20 MHz CMOS Reorder Bufferfor a Superscalar Microprocessor,” 4th NASA Symposium on VLSI Design,October 1992.

Physics Papers

K. Moe, M. Moe, and S. Wallace, “Drag Coefficients of Spheres in Free-mo-lecular Flow,” 1996 AAS/AIAA Space Flight Mechanics Meeting, Austin,Texas, February 1996.

M. Moe, S. Wallace, and K. Moe, “Recommended Drag Coefficients for Aer-onomic Satellites,” The Upper Mesosphere and Lower Thermosphere: AReview of Experiment and Theory, Geophysical Monograph 87, AmericanGeophysical Union, pp. 349-356, 1995.

M. Moe, S. Wallace, and K. Moe, “Refinements in Determining Satellite Dragcoefficients: Method for Resolving Density Discrepancies,” Journal ofGuidance, Control, and Dynamics, June 1993.

M. Moe, S. Wallace, and K. Moe, “Recommended Drag Coefficients for Aer-onomic Satellites,” Chapman Conference, Asilomar, November 1992.

xiv

Page 16: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Abstract of the Dissertation

Scalable Hardware Mechanisms for Superscalar Processors

by

Steven Daniel Wallace

Doctor of Philosophy in Electrical and Computer Engineering

University of California, Irvine, 1997

Professor Nader Bagherzadeh, Chair

Superscalar processors fetch and execute multiple instructions per cycle. As more

instructions can be executed per cycle, an accurate and high bandwidth instruction fetching

mechanism becomes increasingly important to performance. This dissertation describes

and analyzes instruction fetching mechanisms using three different cache types: a simple

cache, an extended cache, and a self-aligned cache. A mathematical model is developed for

each cache technique, and performance is evaluated both in theory and in simulation using

the SPEC95 suite of benchmarks. In all the techniques, the fetching performance is dra-

matically lower than ideal expectations. Prefetching can be used to increase performance.

Nevertheless, single block fetching performance is fundamentally limited by control trans-

fers. Thus, to overcome this limitation, multiple blocks must be fetched in a single cycle.

Accurate branch prediction and instruction fetch prediction of a microprocessor are

also critical to achieve high performance. In order to achieve a high fetching rate for wide-

issue superscalar processors, a scalable method to predict multiple branches per block of

xv

Page 17: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

sequential instructions is presented. Its accuracy is equivalent to a scalar two-level adap-

tive prediction. Also, to overcome the limitation imposed by control transfers, a scalable

method to predict multiple blocks is introduced. Results demonstrate that a two block,

multiple branch prediction mechanism for a block width of eight instructions achieves an

effective fetching rate of eight instructions per cycle.

A major obstacle in designing superscalar processors is the size and port requirement

of the register file. Multiple scalar register files can be used if results are renamed when

they are written to the register file. Consequently, a scalable register file architecture can

be implemented without performance degradation. Another benefit is that the cycle time of

the register file is significantly shortened, potentially producing a tremendous increase in

the speed of the processor.

xvi

Page 18: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Chapter 1

Introduction

The goal of a superscalar microprocessor is to execute multiple instructions per cy-

cle. Instruction-level parallelism (ILP) available in programs can be exploited to realize

this goal [19]. Depending on what type of programs and assumptions used, researchers

have shown that parallelism anywhere from 4 to 90 is available [22, 32, 28, 41]. The po-

tential speedup from program parallelism may not be realized if the processor is unable to

completely take advantage of it. Instruction fetch efficiency, branch prediction, instruction

and data cache, resource allocation, decode width, and issue width are hardware factors

that determine the overall performance of a processor. Instruction fetch efficiency has been

shown to be the greatest factor limiting speedup [43].

For example, an 8-way superscalar processor with a simple fetching hardware using

perfect branch prediction could only expect to fetch less than four instructions per cycle.

This accounts for over 50% of the loss in potential speedup regardless of any other perfor-

mance issues. Thus, the performance is severely reduced even if the ILP in the program

and execution pipeline would be able to execute eight instructions per cycle.

The underlying problem in fetching instructions using a control flow architecture

is branches. To begin with, conditional branches create uncertainty in the flow of control,

which can cause severe performance penalties if not accurately predicted. Even with perfect

1

Page 19: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

2

dynamic branch prediction, conditional and unconditional branches disrupt the sequential

addressing of instructions. The non-sequential accessing of instructions causes difficulty

with fetching instructions in hardware. As a result, the instruction fetcher restricts the

amount of concurrency available to the processor [37].

Branch prediction foretells the outcome of conditional branch instructions. It pre-

dicts the direction of a conditional branch, taken or not taken. Also, the target addresses

for indirect branches and return instructions need to be predicted because they are usually

determined late in the pipeline. Without branch prediction, when a branch instruction is

encountered, instruction fetching must stall until its direction is calculated before proceed-

ing to the next instruction. Using branch prediction, however, a processor can continue

fetching and speculatively execute instructions past this branch. If the prediction is incor-

rect, then all the speculative work must be nullified and instruction fetching restarted at

the correct address. The resulting pipeline bubbles are called the misprediction penalty. In

addition to predicting a branch’s direction, the target address of taken branches need to be

predicted through instruction fetch prediction. Instruction fetch prediction predicts which

instructions to fetch from the instruction cache when there is a branch [8].

After it has been determined which instructions to fetch, the instruction fetch mech-

anism reads the instructions from the instruction cache and delivers them to the decoder.

The instruction fetch mechanism may not be able to fetch all the desired instructions. This

limits the instruction fetch efficiency and overall performance. Potential parallelism from

ILP can not be utilized when instructions are not delivered for decoding and execution at a

sufficient rate. A high performance fetching mechanism is required.

Extensive research in branch prediction and instruction fetch prediction has been

accomplished for scalar processors. Yeh introduced a two-level adaptive branch prediction.

Page 20: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

3

It uses previous branches’ history to index into a Pattern History Table (PHT). He reports

a 97% branch prediction accuracy [53]. Calder proposed the Next Line Set (NLS), which

predicts the next instruction cache line and set to fetch [8]. Both the PHT and NLS provided

excellent branch prediction and instruction fetch prediction for scalar processors; it is not

clear how to scale these techniques for a superscalar processor.

The register file is a design obstacle for superscalar microprocessors. If N instruc-

tions can be issued in a cycle, then a superscalar microprocessor’s register file needs �N

read ports and N write ports to handle the worst-case scenario. The area complexity for

the register file grows proportional to N� [10]. Therefore, a new architecture is needed

to keep the ports of a register file cell constant as N increases. In addition, the register

requirements for high performance and exception handling can be quite high. Farkas et al.

conclude that for best performance, 160 registers are needed for a four-way issue machine,

and 256 registers are needed for an eight-way issue machine [15]. Therefore, it is desirable

to reduce the register requirements and still maintain performance.

A major difficulty in the simultaneous multithreading (SMT) architecture, introduced

by Tullsen et al., was the size of the register file [40]. They supported eight threads on an

eight-way issue machine, using 356 total registers with 16 read ports and 8 write ports.

Compared to a standard 32-register, 2 read port, 1 write port register file of a scalar pro-

cessor, the area of the SMT register file is estimated to be over seven hundred times larger.

To account for the size of the register file, they took two cycles to read registers instead of

one. This underscores the need for a mechanism to scale the register file, yet still have the

benefits of area and access time of a register file for a scalar processor.

Although current superscalar microprocessors are attempting to execute more in-

structions per cycle than previous generations, the increase in performance has not been

Page 21: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

4

substantial and the cost of implementation has greatly increased [38, 52, 1]. This trend

in cost and performance is displayed in Figure 1.1. The area of the processor increases

proportionally to the square of the order of superscalar. In addition, the performance has

dropped off far from ideal speedup. Although ideal speedup is not reasonable, a cost and

performance closer to ideal is desirable, as shown in Figure 1.2. The cost should increase

proportional to the order of the superscalar, N . In addition, the performance should be

able to continue to increase (given enough ILP) as N increases. It should not be limited by

instruction fetching.

This dissertation introduces instruction fetching mechanisms that are scalable in cost

and performance and do not limit instruction fetching. In addition, a scalable organization

of a register file for a superscalar microprocessor is described. The general cost and per-

formance objectives of Figure 1.2 are followed for the hardware mechanisms introduced.

Although every aspect of the instruction fetching mechanisms and register files may not be

scalable, the major storage elements are designed to be scalable yet still retain excellent

performance. The following paragraphs summarize these hardware mechanisms.

In this dissertation, different types of instruction fetching mechanisms are described

and modeled. Three different instruction cache configurations are considered: a simple

cache type, an extended cache type, and a self-aligned cache type. A simple cache uses

a one-to-one mapping between the instruction cache line and the decoder. Unfortunately,

once a control transfer instruction is encountered, remaining instructions from the cache

line must be invalidated. In addition, if the target address of a control transfer instruction is

in the middle of a cache line, previous instructions in that line can not be used. An extended

cache uses a cache line size greater than the maximum number of instructions allowed by

the decoder. This reduces the chance of a limited block size from targets that jump into

Page 22: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

5

1 2 3 4 5 6 7 8 9 10order of superscalar (N)

2

4

6

8

10

12

14

16

18

20

mag

nitu

de r

elat

ive

to s

cala

r pr

oces

sor

ideal

performance

size

Figure 1.1: Current Superscalar Cost and Performance Trends

1 2 3 4 5 6 7 8 9 10order of superscalar (N)

2

4

6

8

10

12

14

16

18

20

mag

nitu

de r

elat

ive

to s

cala

r pr

oces

sor

performance

ideal

size

Figure 1.2: Superscalar Cost and Performance Goals

Page 23: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

6

a middle of a cache line. To completely solve the problem with target addresses, a self-

aligned cache is presented. Unfortunately, all of these cache types are limited by control

transfers.

To approach the upper bound of single block fetching, a form of prefetching can

be used. Instead of limiting fetching to N instructions from a cache line, this number

is increased and extra instructions are put into a prefetch buffer. As a result, when less

than N instructions are retrieved from a cache line, extra instructions previously fetched

can provide the remaining instructions to deliver N instructions to the decoder. In order

to improve fetching beyond what prefetching can accomplish, two cache lines need to be

fetched per cycle. A new fetching mechanism, called the dual branch target buffer, can

predict the addresses for the next two lines. Hence, a two-block fetching mechanism can

increase fetching capability beyond the limitation of one-block fetching and satisfy ever

increasing fetching demands by wide-issue superscalar processors.

The theory behind the fetching techniques gives insight into fetching problems. For

that reason, a probabilistic model based on the probability of a control transfer is developed

for all fetching technique combinations. Given certain program characteristics and fetching

mechanism parameters, the expected performance can be calculated. To demonstrate the

accuracy of the fetching models, they are evaluated under several different conditions and

compared to simulations.

Although the dual branch target buffer is able to fetch two blocks per cycle, its in-

struction fetch accuracy is not as good as scalar methods. In order to achieve extremely

high fetching rates, it is necessary to accurately predict multiple branches per block. The

accuracy of a scalar two-level adaptive branch prediction can be retained by using a blocked

Page 24: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

7

pattern history table suitable for multiple branch prediction. The difficulty arises in retain-

ing this accuracy while predicting multiple blocks per cycle. Essentially, the solution to this

problem is to predict the prediction. The prediction for additional blocks uses the predic-

tion made previously with the same branch history. Hence, high-performance instruction

fetching is possible under realistic conditions.

Lastly, the problem of scalability of a register file is attacked in two directions. First,

by using a multiple banked register file, it is possible to reduce the port requirement to that

of a scalar processor: 2 read ports and 1 write port. This is accomplished by dynamically

renaming registers at result write time instead of at decode time. Second, by improving

the utilization of the registers, the number of registers may be reduced. Both of these

factors can dramatically reduce the area and cycle time of the register file while maintaining

performance.

1.1 Contributions

This dissertation makes three major contributions:

1. Instruction fetch mechanisms, including three different types of instruction cache

configurations, prefetching, and two block fetching are evaluated. A theoretical

model is presented for each type which is able to accurately determine the expected

instruction fetching performance. The limits of single block fetching are clearly

shown. The potential benefit of a two block fetch mechanism to break this barrier is

demonstrated by introducing a dual branch target buffer.

Page 25: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

8

2. A method to provide scalable multiple branch and block prediction is introduced. The

multiple branch predictor retains the accuracy of a scalar branch predictor. Multiple

blocks can be predicted in parallel each cycle.

3. A scalable register file architecture is introduced. Multiple scalar register files are

used instead of a large multi-ported register file. The number of registers can be

reduced by increasing the utilization of registers. As a result, the area of the register

file is dramatically reduced. Also, the cycle time of the register file is shortened,

which can significantly increase the performance of a processor.

1.2 Organization

The remainder of this dissertation is organized into six chapters. Chapter 2 dis-

cusses background material related to this dissertation. Chapter 3 explains the experimen-

tal methodology used throughout the dissertation, including simulation tools, performance

metrics, benchmark programs, and benchmark characteristics. Chapter 4 describes differ-

ent instruction fetching mechanisms: three instruction cache techniques, prefetching, and

two block fetching. A theoretical model for each mechanism is presented and expected

fetching performance is compared against simulated results. Chapter 5 introduces multiple

branch per block prediction and multiple block prediction. Chapter 6 describes a scalable

register file architecture. Chapter 7 presents the conclusion of this dissertation. Finally,

Chapter 8 gives insight into future directions concerning instruction fetching mechanisms

and register file architectures.

Page 26: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Chapter 2

Background

This chapter provides background information and related work on instruction fetch-

ing and register files. Many mechanisms have been proposed to improve branch predic-

tion and instruction fetch prediction. These include two-level adaptive branch prediction,

branch target buffer, and next line and set prediction. This chapter describes the instruc-

tion fetch problem, instruction fetch limitation, dynamic branch prediction, multiple block

fetching, register renaming, and register file complexity.

2.1 Instruction Fetch Problem

Branch instructions create two basic problems with instruction fetching. First, a con-

ditional branch creates uncertainty in which direction should be taken. By the time a branch

is executed and found to be incorrectly predicted, a superscalar processor may have fetched

dozens of instructions which will have to be thrown away. Second, the target address of

a taken branch may not be known, so it also has to be predicted. In addition, a control

transfer disrupts the sequential accessing of instructions, so this requires a different line in

the instruction cache to be accessed.

9

Page 27: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

10

For example, consider a superscalar architecture with six pipeline stages: instruction

fetch (IF), instruction decode and register rename (D/R), issue (IS), register read (RR),

execute (EX), and result commit (RC). These stages are depicted in Figure 2.1. If the target

address of a conditional branch is not known or incorrectly predicted, there is a one cycle

misfetch penalty. Furthermore, at least four more stages are required to detect an incorrectly

predicted conditional branch’s direction or an indirect branch’s address. This misprediction

penalty may be greater if a branch instruction spends more than one cycle in the issue stage

or instructions from a branch’s block have to be re-fetched.

�� ��� �� �� � �

�� ��� �� �� � �

�� ��� �� �� � �

�� ��� �� �� � �

�� ��� �� �� � �

�� ��� �� �� � ��� ��������� ��� �����

�� ������

�������

Figure 2.1: Pipeline Stages of a Superscalar Processor

In addition to branch misprediction, control transfers can cause a significant loss in

fetching throughput. Figure 2.2 demonstrates how a straightforward superscalar fetching

technique would handle control transfers. To begin with, the first block of instructions

fetched discards two instructions after a taken branch. The branch transfers control to the

second block, but the starting position is not at the beginning of the block. As a result,

Page 28: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

11

previous instructions in that block must be invalidated. Another control transfer is encoun-

tered in the second block and the remaining instructions are invalidated. Overall, only four

instructions out of a potential eight were fetched.

add

sub branch

call

lost lost

lost

Block 0

Block 1

Starting PC

lost

Figure 2.2: Simple Fetching Example

The simple fetching example demonstrates two fetching problems caused by control

transfers. First, a branch whose target address is not the beginning of a block results in

lost instructions. This is called a branch alignment problem. As will be discussed in

Chapter 4, the branch alignment problem may be completely solved in hardware. Second,

a control transfer instruction stops the sequential accessing of instructions in an instruction

cache line. As a result, a new instruction line must be read. Unlike the branch alignment

problem, this implies a fundamental limitation on the number of instructions that may be

fetched in one block.

2.1.1 Fetching Limitation

Control transfer instructions impose a limitation to instruction fetching. Let n be the

width of a block and b be the probability that an instruction transfers control. The expected

block run length, r�n� b�, is

r�n� b� � n��� b�n �nX

i��

i��� b�i��b ��� ��� b�n

b� (2.1)

Page 29: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

12

Equation 2.1 represents the weighted sum of all events that could occur in a sequence of

n instructions. The first term is the case where there is no control transfer in a block. The

second term represents all possible permutations of a control transfer in a block. The limit

of r�n� b� as the block width increases is given by

limn��

r�n� b� ��

b� (2.2)

If a control transfer requires another cycle to reach the target address, then only one block

of instructions can be fetched in a cycle. Regardless of the type of software scheduling

or hardware techniques used to improve fetching, ��b is the limit for the average number

of instructions fetched per cycle. Under these conditions, ��b is the maximum average

number of instructions per cycle that can be executed on any single-threaded control-flow

architecture.

Here is an example to illustrate this fundamental fetching limitation. Suppose a pro-

gram executes a million instructions, and one hundred thousand of these instructions trans-

fer control. The probability of a control transfer instruction is therefore one tenth, and an

average of ten instructions fetched per cycle is the theoretical limit. Since each control

transfer instruction requires one cycle, to execute this program would require a minimum

of one hundred thousand cycles. Assuming no other performance problems, this program

can execute a maximum of ten instructions per cycle.

As a result of this limitation, in order to average greater than ��b instructions fetched

per cycle, multiple blocks of instructions must be fetched in one cycle. This requires spe-

cial hardware to predict multiple branches per cycle, which will be discussed in detail in

Chapter 5.

Page 30: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

13

2.2 Software Techniques

Although hardware techniques will only be discussed, the potential benefit from soft-

ware techniques cannot be ignored. Using software techniques, the probability of a control

transfer instruction can be reduced. Loop unrolling is one method [3]. A relatively new

technique proposed by Calder and Grunwald is most promising [6]. By rearranging ba-

sic blocks, conditional branches become more likely not to be taken. This means that the

probability of a control transfer instruction is reduced because a not-taken branch is not a

control transfer. Nevertheless, software will only be able to make limited improvements.

As will be shown in this dissertation, hardware techniques can boost instruction fetching

performance after software improvements. Furthermore, unlike software techniques, hard-

ware techniques are able to address limitations created by control transfers.

Software techniques can be used to perform static branch prediction, which does

not vary during the execution of a program. One form of static branch prediction uses

compile-time heuristics [4, 24, 26]. Profile-based prediction is another method which usu-

ally performs better than compile-time heuristics [16, 26]. Usually the most common types

of static branch predictors implemented include predicting backwards taken and forward

not taken, encoding the most likely direction in the branch instruction, and using delay

slots. Static branch prediction can only reliably achieve around 70-80% accuracy. On the

other hand, dynamic branch prediction, which uses run-time information, can accurately

predict over 90% of dynamic branches.

Page 31: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

14

2.3 Dynamic Branch Prediction

In order to achieve a high branch prediction accuracy, most modern microprocessors

use dynamic branch prediction [38, 52, 39, 30]. Dynamic branch prediction uses informa-

tion from previous execution of branches. Therefore, the prediction of a specific branch

may change depending on the run-time behavior.

The simplest form of dynamic branch prediction is a 1-bit predictor which records

if a branch was taken or not-taken the last time it was executed. A 1-bit predictor may be

stored in the instruction cache. Alternatively, a pattern history table (PHT) may be used to

store 1-bit, 2-bit, or N-bit counters. A 2-bit up-down saturating counter has been shown to

be effective [27]. The 2-bit up-down saturating counter state transition diagram is shown

in Figure 2.3. When a branch is taken, the counter is incremented; when it is not taken, the

counter is decremented. The counter does not decrement past 0 or increment past 3. When

a branch is strongly predicted to be taken or not taken, two consecutive mispredictions are

required to change the prediction (it has a “second chance”). This has proven especially

effective for loop conditional branches.

The PHT may be directly indexed by the PC, as shown in Figure 2.3. Unfortunately,

considerable interference results with this simple mapping. A more effective use of a PHT

is to use branch-correlation and two-level adaptive prediction mechanisms [29, 55, 56]. It

uses a k-bit branch history register to index into a �k-entry PHT, as shown in Figure 2.4. A

history register is updated after each branch. It is shifted to the left one position, and a 1 is

inserted for a taken branch and a 0 is inserted for a not taken branch.

The history register may be a global history register, as in Figure 2.5, or a per-addr

history register, as in Figure 2.6. A global history register (GHR) represents the last k

Page 32: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

15

��

������ �!� �"��

#�$���%�!#&

��'''��

��'''��

��'''��

��'''��

��

��

��

"�

��(�

"�

��(�

"�

��(� ��(�

��(�

��(�

���" )����������

*"��#�(�

�������

*"��#�(�

��������#�(�

+,$����"- ���

������ #�� ���"

���" )��

�������

#�(�

"�� ��(�

��(�

�� � ��.

Figure 2.3: Pattern History Table and 2-bit Counter State Diagram

Page 33: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

16

��

/�� ���!� �"��

� ��)� ���

����� ����

0�� � -������

� ���� �� -��

"��$�� ��

������ �!� �"��

#�$���%�!#&

'�'�'��'''��

��'''��

��'''��

��'''��

��

��

��

"�

��(�

"�

��(�

"�

��(� ��(�

��(�

��(�

���" )����������

*"��#�(�

�������

*"��#�(�

��������#�(�

+,$����"- ���

������ #�� ���"

���" )��

�������

#�(�

"�� ��(�

��(�

(

Figure 2.4: 2-Level Adaptive Branch Prediction

outcomes of the whole program, while a per-addr branch history register (BHR) represents

the last k outcomes for a specific branch. In addition, as shown in Figures 2.5 and 2.6

the PHT may be a single global table or multiple tables indexed by the branch address.

Furthermore, the BHR and PHT may be a per-set variation [55]. Although Yeh found these

schemes to be effective in accuracy, much of the PHT may be left unused, depending on

branch patterns. In order to increase utilization of the PHT and overall accuracy, McFarling

used the exclusive-or of the global history register and the branch address to index the

PHT [25].

Page 34: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

17

'�'�'

1�"$��

������ �!� �"��

#�$���%�!#&

���,2���

������ �!� �"��

#�$���%�!#&

� $�"�� $�� ��������

����� � ��$��

1�"$���!� �"��3�1�"$����!#

%12)&1�"$���!� �"��3����,2�����!#

%12�&

1�"$��

!� �"��

��)� ���

1�"$��

!� �"��

��)� ���

Figure 2.5: Global History Adaptive Branch Prediction

Page 35: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

18

'�'�'

1�"$��

������ �!� �"��

#�$���%�!#&

���,2���

������ �!� �"��

#�$���%�!#&

$�� ��

�����

� $�"�� $�� ��������

����� � ��$��

/�� ��

!� �"��

#�$��

$�� ��

�����

/�� ��

!� �"��

#�$��

���,2����!� �"��3�1�"$����!#

%�2)&���,2����!� �"��3����,2�����!#

%�2�&

Figure 2.6: Per-Addr History Adaptive Branch Prediction

Page 36: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

19

2.4 Instruction Fetch Prediction

The two-level adaptive branch prediction only provides the direction of a branch.

Another mechanism is required to predict the target of a branch. For taken conditional

branches and unconditional jumps, a branch target buffer (BTB) can be used to predict

the target address [24]. To predict return addresses, a BTB may be used, but a return

address stack (RAS) proves to be considerably more accurate [20]. A BTB may vary in its

associativity, and it is indexed using current PC address. If the tag matches, then the target

address is used. Otherwise, the next PC is used. The BTB requires a modest amount of

storage. Each entry needs to store the tag of the branch address and the full target address.

A new technique by Seznec, however, can dramatically reduce the storage requirement by

using a pointer to a page instead of a full page address [34].

Another innovative technique for instruction fetch prediction is the use of a next line

and set table (NLS) [7]. Calder observed all that is immediately needed to access the

instruction cache is the line and a set prediction. Therefore, instead of storing the complete

address like the BTB, an NLS entry records only the index of a line and a set prediction. It

also records a 2-bit branch type information, which can represent an invalid entry, a return

instruction, a conditional branch, or any other type of branch. Figure 2.7 is a block diagram

of the NLS architecture. It is a decoupled branch architecture since it separates the branch

prediction (global two-level adaptive) from the target address prediction. The NLS is a

direct-mapped table, or can be stored in an instruction cache line. Since the NLS table

does not record a tag and has a small storage requirement for an instruction index, it can

store many times more entries than an associative BTB. Given a cost verses performance

comparison, an NLS architecture can be more effective than a BTB architecture [8].

Page 37: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

20

/�� ��

!� �"��

4�

����� ���-�

2����

����(

�����

�5�

������

!� �"��

#�$��67

*�.��8� �

� �����

#�$��

������

*�.�

�����

� ��-���"

�����

8� �

*�.��� ��-���"

������8� �

%+,$��

$�� ��

����&

%�� ��� �� ��&

Figure 2.7: Block Diagram Schematic of the NLS Architecture

Page 38: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

21

2.5 Multiple Block Fetching

In order to achieve a high fetching rate, multiple branches must be predicted in a

single cycle. In addition, in order to fetch beyond control transfer limitations, multiple

blocks need to be fetched per cycle. A basic block is defined to be instructions between

branches, whether they are taken or not taken. This dissertation refers to a block simply as

a group of sequential instructions up to a predefined limit, n, or up to the end of a line. A

line of instructions refers to the group of instructions physically accessed in the instruction

cache. The size of a line may be greater than or equal to the block width n. Therefore,

mechanisms which predict multiple basic blocks may be predicting multiple branches in a

single cache line or across multiple cache lines.

A technique to predict multiple basic blocks was first introduced by Yeh and Patt [54].

Multiple branches can be accurately predicted using his global two-level adaptive branch

prediction. This is accomplished by indexing the PHT with the k-bit GHR. After the GHR

is shifted to the left once, the two remaining possibilities are simultaneously accessed, and

the prediction for the second branch is selected based on the result of the first prediction.

This mechanism is shown in Figure 2.8. The number of simultaneous accesses to the

PHT increases exponentially with the number of branch predictions. In order to predict

multiple target addresses and select among the possibilities, Yeh and Patt introduced a

branch address cache (BAC). Given the current PC, the address of each possible successor

basic block is stored in a BAC entry. As shown in Figure 2.9, this is a tree structure which

grows exponentially with the number of basic block predictions. As a result, the BAC

requires an enormous amount of storage for accurate prediction, and a large percentage is

wasted from paths that are not used.

Page 39: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

22

'�'�'�'�'�'�'1!��

(,�

(

�!#

�����

+ �

$�� ��

��������"

� �

$�� ��

��������"

Figure 2.8: Multiple Global Adaptive Branch Prediction

Seznec et al. [35] recently introduced an innovative way to fetch multiple (two) basic

blocks. Their idea is to always use the current instruction block information to predict the

block following the next instruction block, as shown in Figure 2.10. Its accuracy is almost

as good as a single block fetching and requires little additional storage cost. The major

drawback, as the authors explain, is that the prediction for the second block is dependent

on the prediction from the first block (the tag-matching is serialized). Chapter 5 introduces

a mechanism which is able to predict multiple blocks in parallel without such a dependency.

Another mechanism to effectively fetch multiple basic blocks is a trace buffer [50] or

a trace cache [33]. A trace cache dynamically builds a run of instructions based on a starting

address. If the current address and branch prediction match an entry in the trace cache, then

the instructions stored in the cache are used. Otherwise, a single block of instructions must

be retrieved from the instruction cache, and a new entry is built. Using a 4KB trace cache,

Page 40: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

23

����

$�� ��

'�'�'�'

����

$�� ��

'�'�'�'

����

$�� ��

'�'�'�'

����

$�� ��

'�'�'�'

����

$�� ��

'�'�'�'

����

$�� ��

'�'�'�'

����

$�� ��

'�'�'�'

/2

� ����

-��� ��������� �

%��)&

# *# *#

*##

#

Figure 2.9: Branch Address Tree and Cache Mapping

'�'�'�'�'�'�' '�'�'�'�'�'�' '�'�'�'�'�'�' '�'�'�'�'�'�'

/�"�(� � /�"�(� + /�"�(� 9 /�"�(� �

Figure 2.10: Two-block Ahead Branch Prediction

Page 41: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

24

the instruction fetching rate was significantly improved, even with a 30-40% trace cache

hit rate.

2.6 Register Renaming

Register renaming is an important tool to allow a superscalar to perform out-of-order

execution. One method to rename registers is to use a reorder buffer [36, 19]. It can dy-

namically rename a logical (programmable) register to a unique tag identifier. The result is

written back into this buffer. If there are no exceptions or bad branches, the result eventu-

ally commits by writing the result to the register file.

Another register renaming technique maps logical registers into physical registers

(the index into the physical data array), as performed by the MIPS R10000 [52]. A block

diagram of the renaming process is shown in Figure 2.11. When an instruction is decoded,

a new physical register from a free list is allocated for its destination register and entered

into a mapping table. The old physical register for that register is entered into a recovery

list. The recovery list (also called the active list) maintains the in-order state of instructions

and can be used to undo the mappings in the event of a mispredicted branch or exception.

After an instruction completes and all previous instructions have completed, its register

is committed and the old value is discarded by freeing the old physical register contained

in the recovery list. During decoding of source operands, its logical register number is

used as an index into the mapping table to read the corresponding physical register. The

advantage of this renaming technique is that one data array is used to store both committed

registers (part of the state of the machine) and speculative registers (extra registers reserved

for speculative results until committed).

Page 42: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

25

Free List Recovery List

Mapping Table

OldPhysicalRegister

Allocate new physical register

Commit register by freeingold physical register

Figure 2.11: Block Diagram of Renaming Logic

2.6.1 Recovery

A major drawback with using a mapping table to rename logical registers to physical

registers is the large penalty required to recover from a branch misprediction or exception.

Recovery proceeds by reading the recovery list from the most recent entry until the mispre-

dicted branch or the instruction which caused an exception is encountered. After an entry is

read, the old physical register is used to replace the mapping of that entry’s logical register

number. After all appropriate entries from the recovery list have been read and re-mapped

into the mapping table, the mapping table will reflect the state of the machine after that

mispredicted branch.

The large penalty required to recover the mapping table can be minimized by using

a checkpoint mechanism. The R10000 uses checkpointing for up to four branches, but not

Page 43: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

26

for exceptions [52]. In order for checkpointing to be realistic in hardware, checkpoint stor-

age must be integrated into the basic cell of the mapping table to have direct access for a

single cycle recovery. Thus, a mapping table’s cell size is greatly increased and is not scal-

able with the number of branch checkpoints. With increased speculation from a wide-issue

superscalar and a larger mapping tables from SMT, using standard RAM cells becomes im-

perative for speed and scalability. Chapter 6 will introduce a new hybrid register renaming

technique which significantly reduces this penalty yet still retains some of the benefits of a

mapping table.

2.7 Register File Complexity

The design of a register file for a superscalar processor has become an increasingly

difficult task to accomplish. With increasing issue rates, larger instruction windows, deeper

pipelines, and the advent of simultaneous multithreading, the pressure on the register file

to supply multiple values and store speculative results has dramatically increased. In order

to issue N instructions per cycle, �N registers need to be read for operands and N results

need to be written back. For instance, issuing eight instructions per cycle requires 16 read

ports and 8 write ports on a register file.

Unfortunately, the area of the register file increases proportional to the square of the

number of ports on a register file [10, 9]. Because of the implementation of a data cell in

a register file, each time a port is added, more hardware is required: a transistor, a wire to

access the cell, and a wire to read/write the cell. This increases both the cell’s length and

the cell’s width.

Page 44: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

27

In addition, the access time increases with the number of ports. For example, using

0.5�m CMOS technology, the cycle times of a register file with 64 registers for N �

�� �� �� � are estimated to be 2.8ns, 3.0ns, 3.2ns, 3.6ns, respectively [51]. The increased

cycle time must be taken into account in comparing the effective performance between

different values of N , as will be shown in Section 6.5 of Chapter 6.

Page 45: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Chapter 3

Experimental Methodology

This chapter presents the experimental methodology applied to the remainder of this

dissertation. The experiments presented are simulation based. A benchmark suite of pro-

grams, SPEC95, is executed via simulation, and results are gathered. The results are pre-

sented using metrics described in Section 3.4.

The SPEC95 programs are the industry standard for evaluating the performance of

computer systems. The programs are large in size, and they execute a large number of

dynamic instructions. These factors enable a realistic performance evaluation of a super-

scalar processor using the scalable hardware mechanisms presented in this dissertation.

The SPEC95 benchmark suite consists of two parts: an integer suite, SPECint95, and a

floating point suite, SPECfp95. A description and attributes of these benchmark programs

are given in Section 3.2. The SPECint95 and SPECfp95 suites were executed using the

SPARC instruction-set architecture [49]. In addition, Chapter 6 executes the SPECint95

using the Superscalar Digital Signal Processor (SDSP) instruction-set architecture [43].

The instruction set of the SDSP is very similar to MIPS [21]. Tools for both architectures

were used to facilitate simulation. These tools are described in Section 3.1.

28

Page 46: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

29

3.1 Simulation Tools

A simulator was developed to analyze the performance of the different instruction

fetching and register file mechanisms. This was accomplished by using a front-end to pro-

vide a detailed dynamic instruction trace and a back-end to provide the simulation of the

specific mechanism. The front-end for the SPARC architecture used the Shade instruction-

set simulator [12]. The front-end for the SDSP architecture used the SDSP simulator de-

veloped in [42]. Both front-ends provide user-level traces only. The instruction trace was

generated during simulation run-time by actually executing the instructions. This provides

greater speed and flexibility than using a trace file.

After the program is compiled, the simulator’s front-end reads the program into its

memory and begins execution. The trace information is delivered to the back-end which

collects statistical information based on the machine model. After execution completes, the

statistical information is written to a file in its raw data format. Another program uses this

to display the statistical results, possibly from several different runs.

The SDSP front-end interprets SDSP code, while Shade uses dynamic compilation

to execute the program. When Shade compiles SPARC code, it also annotates it to record

specific run-time information. Additional decoding of each instruction is performed before

it is passed on to the back-end. As a result, a completed trace structure for an instruction

contains the original PC, the opcode, source register identifiers, destination identifier, and

the functional unit type.

The back-end of the simulator for full execution reads the stream of trace structures.

It performs its own decoding by building up a dependency list between the source operands

previous instructions in its instruction window. It also performs renaming of operands.

Page 47: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

30

Instructions in its instruction window are scanned to find ready-to-run instructions. If the

required resource is available, then the instruction is mark issued. After the required latency

of an instruction, the simulator frees the appropriate resource and marks the instruction as

completed. It can simulate different branch prediction methods and handle mispredicted

branches. It can also simulate instruction and data cache misses. The simulator can execute

with different parameters: decode size, instruction window size, different register renaming

options, maximum issue rate, maximum result write rate, line size, instruction cache size,

data cache size, BTB size, PHT size, various penalty options, functional unit specifications

including number and latency, and multiple data cache banks with outstanding request

queue.

3.2 SPEC95 Benchmarks

Each SPEC95 program was compiled using the SunPro compiler with standard op-

timizations (-O) for the SPARC architecture and compiled using the GNU CC compiler

with second-level optimizations (-O2) for the SDSP architecture. A list of each program,

application area, and description is given in Table 3.1.

3.2.1 Program Attributes

Table 3.2 lists the branch attributes for the first billion instructions of each SPEC95

applications on the SPARC architecture. The first column lists the percentage of dynamic

instruction that transfered control, which includes taken conditional branches and any other

type of branch. The second column lists the percentage of any type of branch encountered.

The third column lists the percentage of taken conditional branches. The remaining five

Page 48: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

31

Program Application Area Description

SPECint95

go Artificial intelligence Plays the game of “Go” against itself.

m88ksim CPU simulator Motorola 88000 chip simulator; runs test program

gcc Compiler A benchmark version of the GNU C compiler, version 2.5.3.

Only the “cc1” phase is executed, using pre-processed files.

compress Utility A compression program that uses adaptive

Lempel-Ziv coding. It compresses and decompresses

in-memory text data.

li Interpreter A LISP interpreter.

ijpeg Graphics JPEG compression and decompression.

perl Interpreter Manipulates strings (anagrams) and prime numbers in Perl.

vortex Database Single-user object-oriented database transaction benchmark.

SPECfp95

tomcatv Geometry Generates 2-dimensional, boundary fitted coordinate

systems around general geometric domains.

swim Meteorology Solves the system of shallow water equations

using finite difference approximations.

su2cor Quantum physics Calculates masses of elementary particles

in the framework of the Quark Gluon theory.

hydro2d Astrophysics Uses hydrodynamic Navier Stokes equations

to calculate galactiaal jets.

mgrid Electromagnetism A simplified multigrid solver computing a 3D potential field.

applu Mathematics Solves multiple, independent systems of a block tridiagonal

system using Gaussian elimination (without pivoting).

turb3d Aeronautics Simulates isotropic, homogeneous turbulence in a cube

with periodic boundary conditions.

apsi Meteorology Solves for the mesoscale and synoptic variations of

potential temperature, the mesoscale vertical velocity and

pressure and distribution of pollutants.

fpppp Quantum chemistry Calculates multi-electron integral derivatives.

wave5 Electromagnetism Solves particle Maxwell’s Equations on a Cartesian mesh.

Table 3.1: Description of SPEC95 Applications

Page 49: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

32

columns are a distribution of branch types. The branch types are conditional branches

(CBR), immediate and indirect unconditional branches (UBr), call instruction (Call), and

return instruction (Ret).

3.3 Machine Model

To verify the new register file architecture in Chapter 6 performs well, a reasonable

machine model was chosen that resembles commercial processors including Power PC

604 [38], MIPS R10000 [52], and SUN UltraSPARC [39]. Table 3.3 lists the quantity,

type, and latency of the different function units modeled. The quantity of functional units

for the 8-way superscalar architecture is twice that of the 4-way superscalar architecture.

The machine model parameters used in simulation are:

� instruction cache: 64 Kbyte, two-way set associative LRU, 16 byte line size, 2

banks, self-aligned fetching, 10 cycle miss penalty

� data cache: 64 Kbyte, two-way set associative LRU, 16 byte line size, 4 banks,

2 simultaneous accesses per cycle, lockup-free, write-around, write-through, 4 out-

standing cache request capability, 10 cycle miss penalty

� branch prediction: 2K x 2-bit pattern history table indexed by the exclusive-or of

the PC and global history register

� speculative execution: enabled

� interrupts: precise

� instruction window: centralized; 32 entries for 4-way, 64 entries for 8-way

Page 50: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

33

%Control %CBr Branch Type Distribution

Program Transfer %Branches Taken %CBr %UBr % Call %Ret

gcc 13 21 48 76 10 7 7

compress 9 17 33 70 16 7 7

go 9 14 47 75 11 7 7

ijpeg 5 9 41 76 14 5 5

li 14 22 42 62 12 13 13

m88ksim 10 16 50 72 8 10 10

perl 13 19 50 65 16 10 9

vortex 12 18 54 76 4 10 10

applu 4 6 56 86 14 0 0

apsi 2 3 51 83 11 3 3

fpppp 1 2 60 79 14 4 4

hydro2d 9 12 72 90 6 2 2

mgrid 1 1 61 81 5 7 7

su2cor 12 20 47 70 15 7 7

swim 2 2 69 69 4 13 13

tomcatv 12 19 46 70 15 8 8

turb3d 7 8 68 64 16 10 10

wave5 7 8 63 50 20 15 15

Table 3.2: Branch Attributes of SPEC95 Applications

Page 51: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

34

Table 3.3: Functional Unit Quantity, Type, and Latency

Quantity Type Latency

4-way 8-way

4 8 ALU 1

2 4 Load unit 1

2 4 Store unit -

1 2 Integer multiply 2

1 2 Integer divide 10

4 8 FP add 3

1 2 FP multiply 3

1 2 FP divide 16

1 2 FP other 3

� register file: separate general purpose and floating point register files; logical regis-

ters are mapped to physical registers

� recovery list: 32 entries for 4-way, 64 entries for 8-way

� store buffer: 16 entries

The instruction scheduling logic uses a single instruction window for all functional

units [19]. A reasonable size for the instruction window, 32 entries for 4-way and 64 entries

for 8-way, was chosen that would give good performance and produce a strong demand for

registers during issue and write back. A variable number of instructions, up to the de-

code width of 4 or 8, may be inserted into the instruction window, if entries are available.

Instructions are issued out-of-order using an oldest first algorithm. Store instructions are

Page 52: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

35

issued in-order, and load instructions may be issued out-of-order in between store instruc-

tions.

Each cycle, a variable number of registers, up to the decode width, may be retired

from the recovery list. If entries are available, then they may be used by new instructions,

up to the decode width. The old physical register corresponding to the same destination

register is inserted into the recovery list.

The pipeline stages of the processor modeled are instruction fetch, decode and re-

name, issue, register read, execute, and result commit, as shown in Figure 2.1. Conse-

quently, two levels of bypassing are required for back-to-back execution of dependent in-

structions. Also, instructions dependent on a load are optimistically issued, in expectation

of a cache hit. If a cache miss occurs, then the dependent instructions must be re-issued,

similar to the design in [48]. In addition, the simulator continued to fetch instructions from

the wrong-path until an incorrectly predicted branch is resolved.

3.3.1 Default Configuration

In Chapter 4, the instruction fetching mechanisms are simulated. Therefore, no

branch prediction and execution was simulated. A perfect instruction cache was assumed,

and only the type of cache was considered. Each program was simulated for the first four

billion instructions.

In Chapter 5, branch prediction and instruction fetch prediction were simulated. The

branch prediction used was always based using a PHT and the exclusive-or of the GHR and

block address. A perfect instruction cache was again assumed, and only the type of cache

Page 53: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

36

and number of banks were considered. Each program was simulated for the first one billion

instructions.

Full execution simulation was simulated in Chapter 6. Unless otherwise noted, the

base configuration of the simulator uses the default as described in the above section. Only

the first fifty million instructions were simulated for full execution simulation.

3.4 Performance Metrics

This section covers the performance metrics used in this dissertation. Performance

metrics are used to give an overall impression of the performance of a specific hardware

mechanism. Depending on the performance objective, one performance metric may be

better suited than another. In Chapter 4, the basic instruction fetching performance is ana-

lyzed by using the instructions fetched per cycle (IFPC) metric. In Chapter 5, the Branch

Execution Penalty (BEP) metric is used to indicate the penalty cost for executing multi-

ple branches and blocks, and the Effective Instruction Fetching Rate (IPC f) metric gives

an overall fetching rate including all branch penalties. Finally, in Chapter 6, the instruc-

tions per cycle (IPC) metric is used to show the overall performance including fetching and

execution.

3.4.1 Instructions Fetched Per Cycle (IFPC)

The instructions fetched per cycle represent the average number of instructions re-

turned to the decoder per fetch cycle. This would equal IPC if there were no branch

mispredictions, cache misses, or other stalls in execution. The IFPC represents the raw

Page 54: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

37

fetching rate the instruction fetch mechanism can deliver assuming a perfect instruction

cache, branch prediction, and instruction fetch prediction.

3.4.2 Branch Execution Penalty (BEP)

The branch execution penalty is defined to be

BEP �Total Branch Penalty Cycles

Branches� (3.1)

This gives us the average number of additional cycles required to execute a branch

instruction. The total branch penalty cycles include cycles from a branch misprediction,

misselection, a branch misfetch, an indirect branch misprediction, and a return address

misprediction (see Table 5.3 in Chapter 5). In addition, when fewer than the maximum

number of blocks are fetched using multiple block prediction, the additional cycles required

to fetch remaining blocks are considered part of the branch penalty. This includes bank

conflicts.

3.4.3 Effective Instruction Fetch Rate (IPC f)

The effective instruction fetch rate is similar to IFPC, except now branch prediction

and instruction fetch prediction are taken into consideration. The rest of the processor

execution is assumed to be ideal. The effective instruction fetch rate is computed as,

IPC f � V alid instructions�Fetch cycles (3.2)

where the number of fetch cycles is equal to

Total Branch Penalty Cycles�Blocks fetched

Maximum blocks per cycle(3.3)

Page 55: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

38

The number of blocks refer to the total number of valid blocks fetched and delivered

to the decoder. When fetching multiple blocks per cycle, the maximum blocks per cycle

refer to the number of blocks the fetcher is designed to fetch.

3.4.4 Instructions Per Cycle (IPC)

The instructions per cycle is the total number of dynamic instructions executed di-

vided by the total execution cycles. Ideally, the IPC would be equal to N , the maximum

number of instructions a superscalar processor can issue in a cycle. Due to instruction

fetching, cache misses, data dependencies, and mispredictions, IPC performance is usually

much lower than the ideal rate of N .

Page 56: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Chapter 4

Instruction Fetching Mechanisms

Instruction fetch mechanisms involve the process of how instructions are fetched

from memory and delivered to the decoder. Since this chapter focuses only on hardware in-

struction fetching mechanisms, other performance issues (such as branch prediction, cache,

execution, etc.) are not evaluated. The objective of this chapter is to describe, evaluate, and

provide solutions to the first step in a series of hurdles for exploiting high levels of ILP. �

First, the fetching model used throughout this chapter is described in Section 4.1.

Next, different hardware techniques used for instruction fetching are described in Sec-

tion 4.2. A mathematical model for each hardware technique is presented in Section 4.3.

Finally, Section 4.4 compares the expected instruction fetching performance with results

from simulating the SPEC95 benchmark suite.

4.1 Fetching Model

This section describes the fetching model used in the rest of the chapter. The cache

line size is defined to be the size of a row in the instruction cache. The terms ‘line’ and

‘row’ are used interchangeably. This determines the maximum number of instructions that

�Portions of this chapter were published in Euro-Par ’96 [44]

39

Page 57: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

40

can be accessed simultaneously in one cycle. Also, a block is defined to be a group of

sequential instructions. A block’s width is the maximum number of instructions allowable.

Figure 4.1 is a block diagram showing the different fetching steps. The instruction

cache reads the requested fetch block of width q and returns it to the instruction fetcher.

The instruction decoder receives a decode block of width n. If prefetching is applied, up to

q new instructions from the instruction fetcher go into the prefetch buffer FIFO queue and

n instructions come out. This implies q � n in the diagram. Otherwise if prefetching is not

used, the fetch and decode widths are equal, and the instruction fetcher delivers instructions

directly to the decoder.

I-CACHE

INSTRUCTIONFETCHER

PREFETCHBUFFER

FIFO

INSTRUCTIONDECODER

q

q

n

n

PC

Figure 4.1: Fetching Block Diagram

The instruction fetcher is responsible for determining the new starting PC each cycle

and sending it to the instruction cache. It cooperates with a branch predictor or branch

target buffer, if employed. Calder and Grunwald [5] describe different techniques for fast

PC calculation. Whichever technique is used, the new PC must be determined in the same

cycle. Also, after the instruction fetcher receives the fetch block from the instruction cache,

Page 58: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

41

it performs preliminary decoding to determine the instruction type (or uses prediction/pre-

decoding methods). Instructions after the first instruction that transfers control are invali-

dated.

Johnson defines an instruction run to be the sequentially fetched instructions between

branches [19]. In this dissertation, an instruction run is further specified to be between in-

structions that transfer control. A control transfer instruction includes unconditional jumps

and calls, conditional branches that are taken, and any other instruction that transfers con-

trol, such as a trap. The run length is the number of instructions in a run. In addition, a

block run is defined to be the instructions from the start of the block to the the end of the

block or the first instruction that transfers control. The block run length is the number of

instructions in a block run.

4.2 Hardware Techniques

This section describes hardware techniques which perform instruction fetching. To

begin with, three cache types are described: a simple cache, an extended cache, and a

self-aligned cache. Next, prefetching is described. Finally, a new mechanism to fetch two

blocks per cycle, a dual branch target buffer, is introduced.

4.2.1 Simple Cache

A straightforward approach to fetch instructions from the instruction cache is to have

the line size equal the width of the fetch block. If the starting PC address is not the first

position in the corresponding row of the instruction cache, then the appropriate instructions

Page 59: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

42

are invalidated and fewer than the fetch width are returned. As with all fetching techniques,

if there is an instruction that transfers control, instructions after it are invalidated.

Figure 2.2 showed an example for the simple fetching mechanism. In that example,

the second instruction in the first block was a taken branch, so the third and fourth in-

structions were invalidated. Also, only two instructions from the second block were valid.

Altogether, only four out of a potential eight instructions were used for instruction decoding

and execution, which illustrates the problem with this simple approach.

4.2.2 Extended Cache

One way to reduce the chance that instructions will be lost from an unaligned target

address of a control transfer instruction is to extend the instruction cache line size beyond

the width of the fetch block. To avoid lost instructions on sequential reads that are not

block aligned, the instruction fetcher must be able to save the last n � � instructions in a

row and combine them with instructions that are read the next cycle. Only when there is a

control transfer to the last n� � instructions in a cache row, instructions are lost due to an

unaligned target address.

Figure 4.2 is an example of the extended cache fetching technique using n � � and

an extended cache line size of 8 instructions. The starting PC in this example is at the

third instruction in Line 0. Four instructions are returned to the instruction fetcher in Cycle

1. The last two instructions in Line 0 are saved for the next cycle. During Cycle 2, the

instruction fetcher combines two new instructions read from Line 1 and the two instructions

saved the previous cycle. There is no need to save any instructions this cycle because the

line can be re-read and still be able to return four instructions.

Page 60: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

43

ignore ignore store

store

ignore ignore

asr

asrempty

ignore ignoreignore ignore

subLine 0:

Cycle 1

Cycle 2

Line 1:

Saved:

lsl load

load

add

add

0

0

8

1

1

9

2

2

A

3

B

4

C

5

D

6

E

7

F

addr:

addr:

Starting PC

fetchblock0

fetchblock0

fetchblock2

saved2

fetchblock2

fetchblock3

fetchblock3

fetchblock1

saved1

fetchblock1

Figure 4.2: Extended Fetching Example

4.2.3 Self-Aligned Cache

The target alignment problem can be solved completely in hardware with a self-

aligned instruction cache. The instruction cache reads and concatenates two consecutive

rows within one cycle so as to always be able to return n instructions. To implement a

self-aligned cache, the hardware must either use a dual-port instruction cache, perform two

separate cache accesses in a single cycle, or split the instruction cache into two banks.

Using a two-way interleaved (i.e., two banks) instruction cache is preferred for both space

and timing reasons [13, 17, 14],

Figure 4.3 is an example of the self-aligned cache fetching technique using n � �.

Only the last two instructions in Line 0 are available for use because the starting PC is not

at the first position. Since the following line is read and available during the same cycle,

four instructions are returned by combining the two instructions from Line 0 and the first

two instructions from Line 1.

Page 61: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

44

ignore ignore ignoreignoresubLine 0: Line 1:lsl load add0 1 2 3 4 5 6 7addr:

Starting PC

fetchblock0

fetchblock2

fetchblock3

fetchblock1

Figure 4.3: Self-aligned Fetching Example

4.2.4 Prefetching

All of the above cache types can be used in conjunction with prefetching. Prefetching

helps improve fetching performance, but fetching is still limited because instructions after

a control transfer must be invalidated.

The fetch width, q, q � n, is the number of instructions that are examined for a con-

trol transfer. Let p be the size of the prefetch buffer. After the instruction fetcher searches

up to q instructions for a control transfer, valid instructions are stored into a prefetch buf-

fer. Each cycle, the instruction decoder removes the oldest n instructions from the prefetch

buffer. In essence, the prefetch buffer enables an average performance closer to the larger

expected run length of q instructions compared to n instructions.

Figure 4.4 shows an example using prefetching with n � �, q � �, and p � �.

Starting with an empty prefetch buffer, there are seven valid instructions (this example

shows a complete block of q � � instructions returned by the instruction cache to the

instruction fetcher) before branch. Four instructions are used in this cycle, while the re-

maining three valid instructions are put in the prefetch buffer for later use. The next cycle,

a block of instructions starting with the target address of the branch is read. Only two in-

structions are valid because a call instruction was detected. As a result, three instructions

Page 62: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

45

from the buffer and the first add instruction are used, while the remaining call instruction

is put into the prefetch buffer.

lostbranch

branchlost lost lost lost lost lost

empty empty empty empty

empty

subasrFetchBlock 0:

Cycle 1

Cycle 2

FetchBlock 1:

lsland load

loadcall

add

addadd

1

1

2

2

3

3

4

4

0 1 2 3 4 5 6 7position:

0 1 2 3 4 5 6 7position:

Starting PC

decodeblock0

decodeblock0

decodeblock3

decodeblock2

decodeblock2

decodeblock3

decodeblock1

decodeblock1

prefetchbuffer1

Target PC

Prefetch Buffer

Figure 4.4: Prefetch Example

4.2.5 Dual Branch Target Buffer

In this section the dual branch target buffer (DBTB) is introduced. It is based on the

original branch target buffer (BTB) design by Lee and Smith [24]. Unlike the previous

techniques mentioned thus far, the DBTB can bypass the limitation imposed by a control

transfer. The DBTB is similar to the Branch Address Cache introduced by Yeh, et al. [54],

except the DBTB does not grow exponentially. Conte et al. introduced the collapsing

buffer, which allows intra-block branches [13]. The DBTB can handle both intra-block and

inter-block branches.

The purpose of a BTB is to predict the target address of the next instruction given

the address of the current instruction. This idea is taken one step further. Given the current

PC, the DBTB predicts the starting address of the following two lines. Using the predicted

addresses for the next two lines, a dual-ported instruction cache is used to simultaneously

Page 63: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

46

read them. Hence, the first line may have a control transfer without requiring another cycle

to fetch the subsequent line.

The DBTB is indexed by the starting address of the last row currently being accessed

in the instruction cache (i.e., the current PC). The entry read from the DBTB can be viewed

as two BTB entries, BTB1 and BTB2. The DBTB entry indexed may match both in BTB1

and BTB2, in one or the other, or none at all. This allows a single DBTB entry to be shared

between two different source PCs. Although physically they are one entry, logically they

are separate.

Figure 4.5 is a block diagram of a DBTB entry and shows how it is used in determin-

ing the following two rows’ PC starting address, PC1 and PC2. The tag of the current PC is

checked against the PC tag found in BTB1. If it matches, then the predicted PC1 found in

BTB1 is used. Otherwise, the prediction is to follow through to the next row of the instruc-

tion cache. If the value predicted for PC1 matches the value in BTB2, then the prediction

for PC2 in BTB2 is used; else, PC2 is predicted to be the next row after PC1. The exit posi-

tion in DBTB entry indicates where the control transfer (or follow through) is predicted to

occur. The DBTB entry also contains branch prediction information about all the potential

branches in the referenced line. It may contain no information at all, a 1-bit predictor, a

2-bit saturating predictor, or information for other branch prediction mechanisms.

To save space, an alternative design of the DBTB would be to logically unify BTB1

and BTB2. Only one PC source can be valid, so only one PC tag is stored. In addition for

space savings, the time it takes for PC2 to be ready is reduced because the predicted PC1

does not need to be checked against the tagged PC1 in BTB2. This improvement may be

critical in a processor’s cycle time. The drawback is BTB2 must be invalidated to reflect a

Page 64: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

47

follow through prediction when BTB1 is updated, which can reduce accuracy of prediction.

On the other hand, a BTB2 misprediction does not need to invalidate BTB1.

PCtag

exitposition

branchpredictioninfo

BTB1 BTB2

exitposition

branchpredictioninfo

predicted PC1 PC1

predicted PC2

current PC

=(tag)

(index)

MUX

+ +

=

PC1 PC2

Dual BTB entry:

MUX

linesize

(line/word)

Figure 4.5: Block Diagram of Dual Branch Target Buffer Entry

The DBTB has many different configurations, many similar to the traditional BTB.

Its options include the number of entries, associativity, branch prediction, and a one or two

tagged system. A DBTB can be used with a simple, extended, or self-aligned cache, and

with or without prefetching. Figure 4.6 is a fetching example without prefetching using the

DBTB. The previous cycle, BTB1 predicted PC1 to be at Address 0, and BTB2 predicted

Line 0 to exit at position 1 to PC2 at Address 12. While Line 0 and Line 3 are being read,

PC2 is used to index into the DBTB to predict the next PC1 and PC2. Although Line 0 has

a jump, a full fetch block of four instructions is returned.

Page 65: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

48

sub jumpLine 0: lsl branch

addr: 0 1 2 3 12 13 14 15

ignore ignore ignore ignore

PC1 PC2

fetchblock

fetchblock

fetchblock

fetchblock0 1 2 3

Line 3:

nextPC1

Figure 4.6: Dual Branch Target Buffer Example

4.3 Expected Instruction Fetch

A mathematical model for each type of fetching mechanism from the previous sec-

tion is presented in this section. The model allows the expected instruction fetching perfor-

mance to be calculated. In the next section, the expected performance from this model will

be compared with results from simulation.

4.3.1 Simple Cache

Let Li be the probability a control transfer occurs at position i, and Ei be the proba-

bility the starting address in the block is at position i. Upon a control transfer, if the target

address is equally likely to enter any position in a block, then

Esimple� �n� b� � ��

n� �

ncsimple�n� b�

Esimplei �n� b� �

csimple�n� b�

n� � � i � n� (4.1)

Lsimplei �n� b� �

iXj��

b��� b�i�jEsimplej �n� b�� (4.2)

Page 66: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

49

where c�n� b� is the probability of a control transfer in a block,

csimple�n� b� �nX

i��

Lsimplei �n� b� �

n�

b� n� �

� (4.3)

The total expected instructions fetched per cycle for simple fetching is

F simple�n� b� �nX

i��

Esimplei �n� b�r�i� b� �

csimple�n� b�

b�

n

� � b�n� ��� (4.4)

Equation 4.4 is the weighted sum of the expected number of instructions at each possible

starting position.

4.3.2 Extended Cache

The probability the starting address in the block is at position i for the extended cache

is

Eextend� �n� b�m� � ��

m� �

mcextend�n� b�m�

Eextendi �n� b�m� �

cextend�n� b�m�

m� � � i � m� (4.5)

The probability of a control transfer in a block for the extended cache, given the

extended cache line size m, m � n, is

cextend�n� b�m� ��m� n�

m��� ��� b�n� �

n

m

nb

�� � b�n� ���� (4.6)

The expected instructions fetched per cycle is

F extend�n� b�m� �m� n

mr�n� b� �

n

mF simple�n� b� �

�m� n�

m

��� ��� b�n�

b�

n

m

n

�� � b�n� ���� (4.7)

Page 67: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

50

With the cache line size extended beyond the desired n instructions, if there is a control

transfer, n out of m times it is expected to transfer into the last n instructions of the block,

which behave as the simple fetching case where less than n instructions are available. The

rest of the time n instructions will be available.

4.3.3 Self-aligned Cache

The probability of a control transfer in a block for the self-aligned cache is

calign�n� b� � �� ��� b�n� (4.8)

The expected instructions fetched per cycle for the self-aligned cache is the expected block

run length of width n,

F align�n� b� � r�n� b� ��� ��� b�n

b(4.9)

because n instructions will always be read from the instruction cache.

4.3.4 Prefetching

All three cache techniques can be used in combination with prefetching. The fetch

and decode widths are not equal with prefetching. As a result, q, the fetch width, may now

be substituted for n, the decode width, as a parameter to some of the equations previously

defined that did not use prefetching, as will be indicated.

Let I typei be the probability exactly i instructions are available up to and including a

control transfer instruction or the end of the block, where type is one of the three different

Page 68: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

51

cache types: simple, extend, or align. The equations for the three types are:

Isimplei � ��� b�i��Esimple

q�i�� �q� b� �q�iXj��

b��� b�i��Esimplej �q� b� (4.10)

Iextendi �

�����������������������

��� b�i��Eextendm�i���q� b�m�

���� b�i��bPm�i

j�� Eextendj �q� b�m� � � i � q � �

��� b�q��Pm�q��

j�� Eextendj �q� b�m� i � q

otherwise

(4.11)

Ialigni �

���������������

��� b�i��b � � i � q � �

��� b�q�� i � q

otherwise

(4.12)

Let Pi be the probability the prefetch buffer contains i instructions. Figure 4.7 il-

lustrates the transition from one buffer state to another. It does not show all possible

transitions. The prefetch buffer increases in size when the number of new instructions

is greater than n. It will remain in the same state if exactly n new instructions are available.

It decreases in size when fewer than n new instructions are available. The zero and full

boundary states have additional possible transitions.

P0

In+1

I1+I2+...In

I1+I2+...In-1

In+1+In+2+...Iq

In+In+1+...Iq

P1

In+1

In

In-1

PpPp-1In

In-1

...

Figure 4.7: Prefetch Buffer State Diagram

The probability the prefetch buffer is in state i is

P typei �

���������������

Pj�k�n P

typej I typek i � � j � p � � k � q

Pj�k�n�i P

typej I typek � � i � p� � � j � p � � k � q

Pj�k�n�p P

typej I typek i � p � j � p � � k � q

(4.13)

Page 69: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

52

Also,Pp

i�� Pi � �. Equation 4.13 can be expanded as a system of linear equations and

solved for each Pi.

The total expected instruction fetch for each of the different cache types with pre-

fetching is

F typeprefetch�p� q� n� b� � n�

n��Xi��

�n� i�X

j � k � i

� � j � p

� � k � q

P typej I typek � (4.14)

Notice Equation 4.14 depends only on the last n�� prefetch buffer state sizes since if there

are n� � or more instructions in the prefetch buffer, n instructions are guaranteed for that

cycle.

A problem can arise with prefetching and simple cache type. The prefetch buffer

can be full, and instructions from the fetch block go unused. If this happens, the starting

address of the next cycle will not be the first position, so q instructions will not be available.

Therefore, Equation 4.1 needs to be modified to include this effect, unless a hardware

solution similar to that of the extended cache is included. The hardware would need to save

instructions left over on a prefetch buffer overflow for the following cycle. If this is done,

Equation 4.10 is an accurate model.

4.3.5 Dual Block Fetching

Fetching two blocks per cycle (via the DBTB) with the simple, extended, or self-

aligned cache without prefetching is simply twice the expected value for half the block

size,

F dbtb�type�n� b� � �F type�n

�� b�� (4.15)

Page 70: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

53

If prefetching is used with dual block fetching, the equation for Itypek in Equation 4.13

and Equation 4.14 is replaced with

Idbtb�typek �q� b� �kX

j��

I typej �q

�� b�I typek�j �

q

�� b�� (4.16)

4.3.6 Evaluation

Table 4.1 lists the evaluation of the simple, extended, and self-aligned cache types

without prefetching for b � ��� and for different values of the decode block width n. The

value chosen for b, the probability of a control transfer, is common for RISC architectures.

The probability of a control transfer in a block is listed as well as the expected instructions

fetched per cycle. For n � ��, the fetching rate is close to ��b. Although this large fetching

width achieves excellent fetching performance, it may not be practical to implement in

hardware.

Table 4.1: Expected Instruction Fetch

n csimplen����� F simple

n����� cextendn�������n F extendn�������n calignn����� F align

n�����

1 .125 1.00 .125 1.00 .125 1.00

2 .222 1.78 .228 1.83 .234 1.88

4 .364 2.91 .389 3.11 .414 3.31

8 .533 4.26 .595 4.76 .656 5.25

16 .696 5.57 .789 6.32 .882 7.06

32 .821 6.56 .903 7.23 .986 7.89

64 .901 7.21 .951 7.61 1.00 8.00

Figure 4.8 shows the expected instruction fetch for the simple, extended, and self-

aligned cases without prefetching for b � ���. Although ideally, for a block size of n, a

Page 71: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

54

fetching rate of n instructions per cycle is desired, the difference between this ideal and

the actual rate increases as n increases. Instead, it approaches ��b (8 in this instance) for

each case. The disadvantage for the simple and extended cache techniques is the lower rate

at which they reach the limit. It takes a significantly larger value of n to reach the same

expected fetch performance. With this extended case of m � �n, its value is the average of

the values for the align and simple cases for each n.

1

2

3

4

5

6

7

8

0 5 10 15 20 25 30 35

alignExptected Fetch vs n (b = 1/8)

Exp

ecte

d Fe

tch

n

extend

Exptected Fetch vs n (b = 1/8)

Exp

ecte

d Fe

tch

n

simple

Exptected Fetch vs n (b = 1/8)

Exp

ecte

d Fe

tch

n

ideal

Exptected Fetch vs n (b = 1/8)

Exp

ecte

d Fe

tch

n

Figure 4.8: Expected Instruction Fetch without Prefetching

Page 72: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

55

3.3

3.4

3.5

3.6

3.7

3.8

3.9

4

0 2 4 6 8 10 12 14 16

b = 1/8, n = 4

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=8b = 1/8, n = 4

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=7b = 1/8, n = 4

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=6b = 1/8, n = 4

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=5

b = 1/8, n = 4

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=4

Figure 4.9: Self-Aligned Expected Instruction Fetch with Prefetching (n � �)

Figure 4.9 shows the expected instruction fetch for the self-aligned cache with pre-

fetching for b � ��� and n � � varying p and q. In this case, it takes very little increase

in q and p to reach close to maximum fetching. At q � � and p � �, the expected fetch is

already over 3.95.

Page 73: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

56

5.2

5.4

5.6

5.8

6

6.2

6.4

6.6

6.8

7

0 5 10 15 20 25 30 35

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=16b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=14

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=12

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=10

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=9

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=8

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=11

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=13

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=15

Figure 4.10: Self-Aligned Expected Instruction Fetch with Prefetching (n � �)

Figure 4.10 shows the expected instruction fetch for the self-aligned cache with

prefetching for b � ��� and n � �, varying p and q. The value of the different curves

for each q is identical for p � q � n. After that point, it branches out and approaches its

r�q� b� limit. To reach the ultimate limit of ��b, both q and p need to increase.

Page 74: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

57

4.2

4.4

4.6

4.8

5

5.2

5.4

5.6

0 2 4 6 8 10 12 14 16

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=8

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=16

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=15

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=14

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=13

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=12

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=11

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=10

b = 1/8, n = 8

Exp

ecte

d Fe

tch

Prefetch buffer size, p

q=9

Figure 4.11: Simple Expected Instruction Fetch with Prefetching

Figure 4.11 shows the expected instruction fetch for the simple cache with prefetch-

ing for b � ��� and n � �, varying p and q. Unlike the self-aligned case, each q curve

is distinct and greater than the previous q curve. Even without prefetching (p � ), the

values are not identical because the increase in the line size to q reduces the chance that an

unaligned target address will not be able to return n instructions.

Page 75: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

58

4.5

5

5.5

6

6.5

7

0 2 4 6 8 10 12 14 16

b = 1/8, n = 8, q = p + n

Exp

ecte

d Fe

tch

Prefetch buffer size, p

alignb = 1/8, n = 8, q = p + n

Exp

ecte

d Fe

tch

Prefetch buffer size, p

simple

b = 1/8, n = 8, q = p + n

Exp

ecte

d Fe

tch

Prefetch buffer size, p

extend

Figure 4.12: Different Cache Techniques with Prefetching

Figure 4.12 shows the expected instruction fetch for the simple cache, extended

cache, and self-aligned cache with prefetching for b � ���, n � �, q � p � n, and

m � �q (extended only) verses p. Similar to the cases without prefetching, the extended

cache’s fetching performance is between the simple and self-aligned cache techniques.

Page 76: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

59

0 2 4 6 8 10 12 14 167

8

9

10

11

12

13

14b = 1/8, n = 16, q = p + n

Exp

ecte

d F

etch

Prefetch buffer size, p

align

extend

simple

Figure 4.13: Different Cache Techniques for Dual Block Fetching with Prefetching

Figure 4.13 shows the expected instruction fetch for the simple cache, extended

cache, and self-aligned cache for dual block fetching with prefetching. The parameters

are b � ���, n � ��, q � p� n, and m � �q (extended only) verses p. The plot shows that

a simple cache performs significantly less than the self-aligned and extended cache.

The plots presented show that prefetching can significantly increase expected fetch-

ing. As the fetch width, q, increases, the expected fetch rate reaches a higher plateau.

Unfortunately, with b � ��� and a decode width of eight, an extensive amount of hardware

– a fetch width of sixteen, a prefetch buffer size of thirty-two, and a self-aligned cache –

is required to reach almost 7 instructions fetched per cycle, still noticeably below the goal

Page 77: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

60

of 8 instructions fetched per cycle. It is difficult to achieve a high fetching rate under those

conditions because the decode width is the same size as the ��b limit. On the other hand,

if two blocks are fetched in a cycle with prefetching, a high rate close to 14 instructions

fetched per cycle can be achieved.

4.4 Results and Discussion

This section compares the expected instruction fetch with the actual performance

of simulations from the SPEC95 benchmark suite running on the SPARC architecture.

Programs ran until completion or the first four billion instructions.

Table 4.2 shows the predicted and observed instruction fetch count results of these

programs using the three cache techniques without prefetching (n � �). Table 4.3 and

Table 4.4 show the predicted and observed instruction fetch count results using the three

cache techniques with prefetching (n � �, q � �, p � �; and n � �, q � ��, p � ��,

respectively). The first column in both tables show the value observed for ��b, the average

run length. The average dynamic run length of a program is the total number of instructions

executed divided by the number of instructions that transferred control. The observed value

of b for each program was used in its calculation of the expected fetch.

A concern with the fetching model presented is that the distribution of run lengths is

expected to be uniform, but in observing actual program behavior, the distribution is not

uniform. It does, however, generally follow the expected distribution. When the expected

fetch is calculated via a weighted sum, the outcome is reasonably accurate. As can be

seen in the tables, the difference between the predicted and observed fetch count is usually

within a few percent.

Page 78: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

61

Table 4.2: Instructions Fetched per Cycle (n � �)

Program ��b simple extend align

pred obs pred obs pred obs

go 11.6 3.18 3.20 3.35 3.34 3.51 3.56

gcc 7.5 2.86 2.86 3.07 3.06 3.27 3.41

m88ksim 10.2 3.09 3.01 3.27 3.12 3.45 3.48

compress 9.93 3.07 3.31 3.25 3.43 3.44 3.59

li 6.9 2.78 2.75 2.99 3.10 3.21 3.31

ijpeg 21.5 3.51 3.51 3.62 3.59 3.73 3.73

perl 7.8 2.89 2.88 3.09 3.14 3.29 3.36

vortex 9.2 3.01 2.90 3.20 3.02 3.39 3.57

tomcatv 22.0 3.52 3.40 3.63 3.47 3.74 3.69

swim 114 3.90 3.86 3.92 3.93 3.95 3.96

su2cor 11.7 3.19 3.25 3.35 3.40 3.52 3.62

hydro2d 12.9 3.24 3.24 3.40 3.34 3.56 3.63

mgrid 79.6 3.85 3.68 3.89 3.81 3.93 3.85

applu 25.3 3.58 3.45 3.67 3.61 3.77 3.73

turb3d 14.6 3.32 3.37 3.46 3.46 3.61 3.69

apsi 54.9 3.79 3.76 3.84 3.81 3.89 3.87

fpppp 13.6 3.28 3.13 3.43 3.30 3.58 3.60

wave5 23.6 3.55 3.54 3.65 3.62 3.75 3.74

Page 79: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

62

Table 4.3: Instructions Fetched per Cycle with Prefetching (n � �)

Program ��b simple extend align

pred obs pred obs pred obs

go 11.6 3.95 3.93 3.95 3.97 3.99 4.00

gcc 7.5 3.76 3.62 3.91 3.77 3.96 3.99

m88ksim 10.2 3.92 3.89 3.97 3.95 3.99 4.00

compress 9.93 3.91 3.92 3.97 3.98 3.99 3.99

li 6.9 3.69 3.71 3.87 3.87 3.94 3.98

ijpeg 21.5 3.99 3.96 4.00 3.96 4.00 4.00

perl 7.8 3.79 3.66 3.92 3.80 3.97 3.99

vortex 9.2 3.88 3.58 3.96 3.68 3.98 4.00

tomcatv 22.0 3.99 3.95 4.00 3.99 4.00 4.00

swim 114 4.00 4.00 4.00 4.00 4.00 4.00

su2cor 11.7 3.95 3.86 3.99 3.99 3.99 4.00

hydro2d 12.9 3.97 3.72 3.99 3.92 4.00 4.00

mgrid 79.6 4.00 4.00 4.00 4.00 4.00 4.00

applu 25.3 4.00 4.00 4.00 4.00 4.00 4.00

turb3d 14.6 3.98 3.69 3.99 3.77 4.00 3.99

apsi 54.9 4.00 4.00 4.00 4.00 4.00 4.00

fpppp 13.6 3.97 3.74 3.99 3.81 4.00 3.99

wave5 23.6 3.99 3.96 4.00 3.99 4.00 4.00

Page 80: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

63

Table 4.4: Instructions Fetched per Cycle with Prefetching (n � �)

Program ��b simple extend align

pred obs pred obs pred obs

go 11.6 6.75 6.75 7.33 7.33 7.65 7.75

gcc 7.5 5.32 5.10 5.98 5.58 6.55 6.52

m88ksim 10.2 6.36 6.22 7.02 6.85 7.44 7.44

compress 9.93 6.27 7.19 6.94 7.45 7.38 7.64

li 6.9 5.03 4.89 5.66 5.86 6.22 6.49

ijpeg 21.5 7.79 7.14 7.93 7.51 7.97 7.89

perl 7.8 5.45 5.16 6.13 5.65 6.69 6.64

vortex 9.2 6.02 5.38 6.71 5.81 7.20 7.05

tomcatv 22.0 7.56 7.03 7.83 7.51 7.92 7.82

swim 114 8.00 7.95 8.00 7.98 8.00 7.99

su2cor 11.7 6.77 6.10 7.35 6.61 7.66 7.53

hydro2d 12.9 7.03 6.39 7.53 6.80 7.77 7.25

mgrid 79.6 8.00 7.97 8.00 7.99 8.00 8.00

applu 25.3 7.88 7.56 7.96 7.72 7.98 7.96

turb3d 14.6 7.30 5.87 7.69 6.29 7.85 6.92

apsi 54.9 7.99 7.93 8.00 7.98 8.00 8.00

fpppp 13.6 7.15 6.09 7.60 6.43 7.80 6.96

wave5 23.6 7.85 7.42 7.95 7.73 7.98 7.92

Page 81: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

64

The expected and observed performance for dual block fetching without prefetching

is exactly twice the values listed in Table 4.2 for n � �. Table 4.5 lists the performance of

SPEC95 for dual block fetching with prefetching (n � �� q � ��� p � �). The instructions

fetched per cycle (IFPC) is listed as well as the instructions per fetch block (IPB). The

results show that close to ideal (n � �) fetching rate is possible, when a two-block fetching

mechanism, such as the dual branch target buffer, is used with extended or self-aligned

cache and prefetching. In this case, the fetching hardware mechanism no longer restricts

instruction fetching, and therefore, the possibility of exploiting instruction-level parallelism

and a high instructions per cycle execution rate.

Using a 256-entry, direct-mapped, two-tagged DBTB, the miss rate was between 10%

and 20% for most of the SPEC95 benchmarks. Also, the miss rates for BTB2 was usually

slightly higher than BTB1. BTB1 and BTB2 each behaved similarly to a standard BTB.

Although perfect branch accuracy was assumed in Table 4.5 (to make a fair comparison to

the other data), it is important to realize that accurate branch prediction becomes critical

since more branches need to be predicted accurately per fetch block. Therefore, the next

chapter presents a mechanism to predict two blocks per cycle with a greater accuracy than

the dual branch target buffer.

The overall performance will be much lower than the fetching rates shown when

branch prediction, cache misses, execution, etc., of a real microprocessor are simulated.

In addition, the difference between the values will be much smaller. These facts do not

devalue the results presented. These results show the upper limit achievable using different

fetching mechanisms presented, both in theory and in simulation.

Page 82: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

65

Table 4.5: IPB and IFPC for Dual Block Fetching with Prefetching

Program ��b simple extend align

IPB IFPC IPB IFPC IPB IFPC

go 11.6 9.90 7.79 11.2 7.90 12.3 7.98

gcc 7.5 8.01 7.18 9.3 7.49 10.5 7.93

m88ksim 10.2 9.24 7.68 11.0 7.87 11.7 7.98

compress 9.93 9.98 7.78 10.5 7.86 11.8 7.95

li 6.9 7.64 7.37 9.8 7.74 10.4 7.91

ijpeg 21.5 12.0 7.89 12.8 7.91 13.6 8.00

perl 7.8 8.37 7.36 9.9 7.67 10.7 7.93

vortex 9.2 8.07 7.14 10.6 7.52 11.8 7.99

tomcatv 22.0 11.7 7.88 12.4 7.97 13.9 8.00

swim 114 15.0 7.99 15.3 8.00 15.8 8.00

su2cor 11.7 9.5 7.53 11.2 7.78 12.4 7.99

hydro2d 12.9 9.8 7.36 11.5 7.90 12.7 8.00

mgrid 79.6 15.5 8.00 15.6 8.00 15.7 8.00

applu 25.3 12.4 7.97 13.2 8.00 13.9 8.00

turb3d 14.6 10.6 7.38 11.9 7.53 12.5 7.94

apsi 54.9 14.0 7.98 14.7 8.00 15.1 8.00

fpppp 13.6 13.0 7.79 13.5 7.92 14.9 7.99

wave5 23.6 12.4 7.89 13.2 7.96 14.0 7.99

Page 83: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Chapter 5

Multiple Branch and Block Prediction

Multiple branches and multiple blocks must be predicted in a single cycle to achieve

a high instruction fetching rate. This chapter describes how to predict multiple branches in

a single block and how to predict multiple blocks per cycle. �

A block of instructions may contain multiple basic blocks because some of the con-

ditional branches encountered may be predicted not taken. A prediction mechanism that

can only predict basic blocks limits potential performance improvement, since it may only

be able to predict one line to read from the instruction cache instead of multiple lines from

the instruction cache. Hence, Section 5.1 introduces how to predict multiple branches in a

single block. This allows a block of sequential instructions to be read up to the first control

transfer.

As Chapter 4 concluded, multiple blocks of sequential instructions need to be fetched

each cycle to overcome the limitation of single-block fetching. As a result, an accurate

prediction mechanism is required to predict multiple blocks. Therefore, Section 5.2 intro-

duces a novel mechanism to predict two blocks per cycle, using a select table. In addition,

�Parts of this chapter appear in the Third International Symposium on High-Performance Computer

Architecture [47]

66

Page 84: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

67

Section 5.4 explains how the select table can be expanded to predict multiple blocks per cy-

cle. An important feature of this prediction mechanism is that multiple blocks are predicted

in parallel.

The performance of these new prediction mechanisms are studied in Section 5.3. The

accuracy of predicting multiple branches per block is shown to be as good as predicting

each branch one at a time. The effective fetching performance of predicting one block,

two blocks, and multiple blocks per cycle is presented. Different types of misprediction

resulting from dual block prediction are described, and distributions of the contribution of

each type to the average branch execution penalty are given.

Finally, Section 5.5 presents cost estimates of multiple branch and block prediction

in terms of hardware storage and timing requirements.

5.1 Multiple Branch Prediction

The multiple global adaptive branch prediction by Yeh and Patt discussed in Chap-

ter 2 retains the accuracy of their original single branch prediction. However, multiple

reads from the PHT are not necessary for predicting multiple branches in a single block.

Yeh’s original two-level adaptive branch prediction can easily be scaled to perform mul-

tiple branch prediction for a single block. All of his schemes involve finding pattern his-

tory information to predict a single branch using a 2-bit up/down saturating counter (see

Figure 2.4). A pattern history entry is expanded to contain information not for one branch

instruction, but for an entire block of potential branch instructions. For example, if eight

instructions per block are being fetched, a PHT entry will contain eight 2-bit counters, one

for each position in a block.

Page 85: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

68

One important difference is the updating of the global history register (GHR) or

branch history register (BHR). Instead of being updated after the prediction of each in-

dividual branch, it is updated after the prediction for the entire block. Updating the GHR

after each branch requires multiple reads from the PHT. In order to avoid this, the GHR

is updated after each block, possibly containing multiple branches. As a result, only a

single entry needs to be read from the PHT. For example, if three branches are predicted

not taken, not taken, and taken, then the GHR/BHR is shifted to the left three bits and a

“001” inserted. All of Yeh’s original variations may be expanded in this manner, except his

per-addr variation now becomes a per-block variation.

The difference between Yeh’s multiple global adaptive branch prediction and multi-

ple branch prediction using a blocked PHT can be highlighted by considering an example.

In this example, every other instruction in a block of eight instructions is a conditional

branch. Figure 5.1 shows how Yeh’s multiple global adaptive branch prediction predicts

these four branches. Starting with a GHR of “0001”, it reads 15 entries – which is difficult

to do – and selects four of those entries for prediction. The selection of the second branch

is based on the prediction of the first branch; the selection of the third branch is based on

the prediction of the first and second branches, etc. As a result, the complexity of this mul-

tiplexer selection grows exponentially with the number of branch predictions. In contrast,

the blocked PHT method in Figure 5.2 reads a block of eight sequential counters – which

is easy to read – and selects the appropriate counters based on the least significant bits of

the branch’s address. The blocked PHT can predict as many branches in the block up to the

first taken branch.

Figure 5.3 is a block diagram of a multiple branch prediction fetching mechanism.

While the instruction cache is reading the current block of instructions, the instruction

fetcher at a minimum must predict the index of the next line to retrieve from the instruction

Page 86: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

69

�!#

����

����

����

����

����

����

����

����

����

����

����

����

����

����

����

����

� �����

1!�������

+

9

� �

$�� ��

%������&

+ �

$�� ��

%�����9&

9��

$�� ��

%�����:&

���

$�� ��

%�����;&

Figure 5.1: Multiple Global Adaptive Branch Prediction Example

Page 87: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

70

����

����

����

����

����

����

����

����

����

����

����

����

����

����

����

����

� $� ����� � � + 9 � : < ;

� �����

1!������� /�"�(������� �!� �"���#�$��

����� � ����

�"0�"���"- ��� '

������� ��������"

$� ��� " � �����

Figure 5.2: Multiple Branch Prediction with Blocked PHT Example

Page 88: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

71

cache. The complete address may be determined during subsequent cycles. Therefore,

an efficient method to predict target addresses is to use an NLS table. The NLS table is

modified and expanded to be indexed by the instruction block address and contain target

lines for an entire block of instructions. Alternatively, a Branch Target Buffer (BTB) may

be used [24]. The BTB, however, is also modified to be indexed and checked against the

instruction block address and contain target addresses for an entire block of instructions.

The NLS or BTB may be viewed as n separate tables accessed in parallel, which predict the

target address for each of the n possible branch exit positions. The actual target address, if

any, is selected at a later time. An NLS or BTB which predicts targets for a whole block is

called a target array.

In addition, the branch type information is no longer contained in the NLS table, but

in a separate block instruction type (BIT) table. In superscalar fetch prediction, knowing

what type of instructions are in a block is the most critical piece of information. Each

BIT entry contains two bits of information for each instruction in a cache line. This BIT

information may be pre-decoded and contained in the instruction cache line. For a faster

access time, it can be stored in a separate array. Instead of storing BIT information for each

instruction in the cache, a separate direct-mapped BIT table can be used with fewer entries

than the number of lines in the instruction cache. In this case, BIT information is originally

assumed by the predictor to be correct for prediction. After the line has been read from the

instruction cache, the BIT information is verified and replaced, if necessary.

At a minimum, the BIT information for each instruction in a fetch block must contain

at least two bits to represent that an instruction is either not a branch, a return instruction, a

conditional branch, or other types of branches. If this is expanded to three bits per instruc-

tion, it can contain additional information about conditional branches with targets adjacent

to the current line, referred to as near-block targets. The offset into the line may be quickly

Page 89: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

72

/�� ��

!� �"��

4�

�����

*8��"��/#/

#��)��

2����

���-�

2����

����(

,

�,�����

�� �� �5�

������

!� �"��

#�$��

/�"�(

� ��-���"

#���

������

*�.�

�����

� ��-���"

�����

8� �

67

*�.��� ��-���"

������8� �

Figure 5.3: Block Diagram of a Multiple Branch Prediction Fetching Mechanism

Page 90: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

73

added with a log�n-bit adder as soon as the branch offset is ready. As a result, near-block

target addresses do not need to be stored in the target array, and the size of the target array

can be reduced.

Given the starting position in the line fetched, BIT and PHT block information, the

instruction fetch control logic uses the instruction type information to find the first uncon-

ditional branch or conditional branch predicted to be taken based on its pattern history.

The next line to be fetched is then selected from a multiplexer whose input contains the

current line, previous line, following line, two lines after the current line, the top of the

return address stack (RAS), and the n possible targets from branches in a block. The BIT

codes and resulting prediction sources are summarized in Table 5.1. A schematic of the

logic required to select the branch position using the BIT and PHT information for a four

instruction block is shown in Figure 5.4 (except BIT type “011” replaces BIT type “111”

in Table 5.1 to simplify logic).

Table 5.1: Block Information Types and Prediction Sources

Instruction Type Prediction Source

0 0 0 Non-branch Fall-through PC

0 0 1 Return Return Stack

0 1 0 Other branches Always use Target Array

0 1 1 Conditional branch, Target Array entry or Fall-

long target through, depending on PHT

1 0 0 Cond. branch, prev line Current line - line size

1 0 1 Cond. branch, same line Current line

1 1 0 Cond. branch, next line Current line + line size

1 1 1 Cond. branch, next line+1 Current line + 2 * line size

Page 91: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

74

PHT1

BIT0

BIT2

BIT1

PHT1

BIT0

BIT2

BIT1

PHT1

BIT0

BIT2

BIT1

PHT1

BIT0

BIT2

BIT1

Branch Position

2

Block Position 0 Block Position 1 Block Position 2 Block Position 3

Fall−Through

Priority Encoder

idle priority

3 2 1 0

Figure 5.4: Branch Selection Logic

Page 92: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

75

The processor should keep track of the target address of each conditional branch that

is predicted not taken. In the event it is mispredicted, the correct block may be immedi-

ately fetched the following cycle after branch resolution. Otherwise, an additional cycle is

required to read the target address from the target array.

Table 5.2 is an example showing a line of instructions and the result of prediction.

The type of instruction, BIT information code, and PHT entry values are given. The starting

position corresponds to the beginning of a block. The exit position is where an instruction

transfers control. For each possible starting position, the exit position, next line select pre-

diction, target used for a misprediction, and the new prediction used after a misprediction

are shown. NLS(x) indicates that the target address for the exit position x is selected from

the NLS target array. For instance, if the starting position is 4, the exit position is 5 where a

conditional branch is predicted to be taken, and the NLS at position 5 is used for the target

address. If the branch is mispredicted, the return address stack is used as the target for the

next block. Since the pattern history indicates a “second chance” bit, the prediction will

not change the next time the branch is encountered.

Table 5.2: Next Line Prediction Example Based on Starting Position

Position in block 0 1 2 3 4 5 6 7

instruction type shift branch add jump sub branch move return

BIT value 000 100 000 010 000 011 000 001

PHT value XX 10 XX XX XX 11 XX XX

exit position 1 1 3 3 5 5 7 7

select prediction line-- line-- NLS(3) NLS(3) NLS(5) NLS(5) RAS RAS

target on misprediction NLS(3) NLS(3) N/A N/A RAS RAS N/A N/A

select replacement NLS(3) NLS(3) N/A N/A NLS(5) NLS(5) N/A N/A

Page 93: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

76

5.2 Dual Block Prediction

Once an instruction which transfers control is encountered, no more instructions in a

block may be used. Another cycle is required to fetch from a different line in the instruction

cache. This is a barrier to fetching a large number of instructions in a single cycle. Hence,

what is needed is the capability to fetch multiple blocks at the same cycle. The problem is

determining which blocks to fetch each cycle.

Fetching two blocks per cycle requires predicting two lines per cycle. In order to

accomplish this prediction completely in parallel, only the address of the two lines currently

being fetched and any available branch history information may be used as a basis for

prediction. Using the PC from the last block currently being fetched, the first line can be

predicted using methods from the previous section. The difficulty arises in predicting the

following (second) line.

The underlying problem with predicting two lines to fetch is that the prediction for the

second line is dependent on the first. Hence, the PHT and BIT information for the second

line cannot be fetched until the first line has been predicted, and the new PC and GHR have

been determined. The solution to this problem is essentially to predict the prediction. The

end result of using the BIT and PHT for prediction is a multiplexer selector. Therefore,

because the BIT and PHT information for the second block prediction are not available, we

store the multiplexer selection bits of a previous prediction for that block into a select table

(ST). The select table is indexed by the exclusive-or of the GHR and the current PC block

address [25]. This index is the same as the index into the PHT for the prediction of the first

block. The select value read from the select table is used to directly control the multiplexer

for the second block prediction. A 3-bit selector can be used with a block width of four

(n � �). Four bits are required for n � �.

Page 94: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

77

5.2.1 Single Selection

Figure 5.5 is a block diagram of a dual block (two-block) prediction fetching mech-

anism. It has two multiplexors to select the next two lines to fetch. The first selection is

calculated from the PHT and BIT information. The second selection comes from the select

table. To accurately predict target addresses, a dual target array is used. It provides n target

addresses for the first block, and n target addresses for the second block. The address of the

second block currently being fetched is used as the index into both target arrays. Although

the NLS must have two target arrays, a BTB may use its tag to indicate the block number

(one or two).

Undesirable duplication of target addresses is inherent to the dual target array. A

branch’s target address could be stored in both target arrays. Also, it may be represented

in the second target array multiple times, since a branch may have multiple predecessor

blocks. This duplication, however, does not significantly reduce its accuracy compared to

a single target array.

The second multiplexer shown in Figure 5.5 is dependent on the output of the first

multiplexer. An addition to determine the fall-through address of the first prediction or

other near-block targets is required. Although the addition of a line index is relatively

small, if timing is critical, each of the n targets from the first target array and the RAS

can calculate the fall-through (and possibly near target(s)) indexes before the first block

selector is ready. The fall-through adder used as input for the second multiplexer can now

be replaced with a multiplexer which selects the correct pre-computed fall-through address

from the first target.

Page 95: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

78

The RAS sends the top of its stack to the input of the first multiplexer. For the second

multiplexer, if the first block performs a call, the RAS input is bypassed with the address

after the exit address of the first block. If the first block performs a return, the RAS sends

the second address off the stack. Otherwise, the top of the stack is sent to the second

multiplexer. In addition, the target array should encode whether or not its target is a result

of a call, so that proper return bypassing can take place.

Figure 5.6 displays the pipeline stages involved in the dual block prediction. The first

stage is the prediction of the next two blocks (bX denoted block # X). The selector for

the first predicted block is computed from BIT and PHT information. The second block

is predicted by reading the select table. The second stage fetches the two blocks. It also

verifies the select prediction in the previous stage against prediction computed using the

PHT and BIT information. If the prediction is different, then a misselect has occurred. The

previous prediction is replaced with the new prediction in the select table, and the new block

is fetched. Also during the second stage, the predicted target address of the first block is

checked against the calculated branch offset or immediate branch from the previous block

(misfetch). The third stage checks for a misfetch of the second block.

From the pipeline diagram, two problems are observed. One problem is with the

updating of the GHR. The GHR can reflect the outcome of the first block prediction, but

for the second block prediction, there is no information about the number of conditional

branches predicted or their outcome. Therefore, the select table entry needs to contain

prediction information to update the GHR. This can be accomplished by using log�n bits

to represent the number of not taken branches and one bit to represent either a fall-through

case or a taken branch. GHR prediction may be avoided by assuming a limited number of

conditional branches have been predicted “not taken.” Multiple entries in the PHT and ST

Page 96: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

79

*8��"��/#/

�-���#��)��

2����

���-�

2����

����(

,

�,�����

�� �� �5�

������

!� �"��

#�$��

������

*�.�

�����

/�� ��

!� �"��

4�

�����

67

�"�

*�.��%��� �&�� ��-���"

������8� �

������� #�$��

�"�

���" �

#��)��

67

�"�

*�.��%���" �&�� ��-���"

������8� �

,�,�����

�� �� �5��

%���)��� �&

%���)��� +&

-��� �� %���" �&

� ��-���"

������8� �

/�"�(

� ��-���"

#���

Figure 5.5: Block Diagram for Dual Block Prediction

Page 97: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

80

��������$9

��"��$+

/�#�=��!#

�����

$��=�$+

�������$�

��������"

>������$�

�����

���(

$���"�

�� �����

��������$:

��"��$�

/�#�=��!#

�����

$9�=�$�

�������$<

��������"

��������$�

��"��$�

/�#�=��!#

�����

$:�=�$<

�������$+

��������"

>������$<

�����

���(

$9��"�

�� �����

>������$+

�����

���(

$:��"�

�� �����

�������������� ������������ ������������ ��������������

� ���� �� ���� �

� ���� � ����

� ���� � ����

���(

$+��"�

�� �����

���(

$���"�

�� �����

���(

$<��"�

�� �����

��������������

Figure 5.6: Pipeline Stage Diagram for Dual Block Prediction

Page 98: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

81

are read, and the correct entries are chosen once the number of branches in the previous

block have been determined.

Near-block select prediction for the second block causes another problem. It does

not give information about the offset into the line. As a result, up to log�n extra bits are

needed to provide this information, or there may be enough time to calculate the line offset

after its source block has been read. To avoid this problem, targets can always be predicted

from the target array instead of using near-block targets. The GHR and position prediction

(if any) are verified at the same time as the select prediction.

5.2.2 Double Selection

The selection prediction can be used on the first block as well as the second block.

Selection prediction of both blocks is referred to as double selection. Figure 5.7 is a block

diagram of two-block prediction using double prediction. Double selection increases the

misselect penalty. However, the benefit is the removal of BIT storage altogether. The

instruction type is decoded after the line has been fetched. The select table is still indexed

by the exclusive-or of the GHR and starting address, but it is now a dual select table,

providing selectors for both multiplexors. Timing concerns regarding the calculation of the

selector bits for the first target no longer exist. The potential for timing problems from the

adders between the multiplexers is significantly reduced. Selector and GHR prediction bits

for both blocks are required, although the starting position prediction for the second block

is no longer needed.

Figure 5.8 is a pipeline diagram using double selection. The first stage predicts the

next two blocks from the dual select table. The second stage fetches the two blocks, and

Page 99: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

82

verifies the first block’s select prediction and target address. The third stage verifies the

second block’s select prediction and target address.

5.2.3 Misprediction

The penalties for the different types of possible mispredictions are listed in Table 5.3.

It is assumed that it takes four cycles to resolve a branch after it has been fetched. For the

first block, if there are remaining instructions required to be re-fetched after a conditional

branch was mispredicted taken, then it will take an additional cycle. A misprediction on

the second block always requires another cycle. There is a one cycle misselect or GHR

mispredict penalty using a single selection for the second block.

With double selection, the first block has a one cycle penalty while the second block

takes two cycles for a misselection. Since a misselect is detected during or immediately

after the instructions have been fetched, instructions that would have been discarded on a

taken branch become valid, and no re-fetch cycle is needed. A misfetch takes one cycle for

the first block and two cycles for the second block to detect.

Since multiple blocks are being fetched using different cache lines, a multiple banked

instruction cache is required. With dual block fetching, two lines are fetched simultane-

ously, so they may map into the same cache bank. Should a conflict arise, the second line

is read the next cycle.

In order to facilitate recovery from a mispredicted branch, each conditional branch is

assigned a bad branch recovery (BBR) entry, which provides information on how to update

branch prediction tables and provide a new target. The processor must create this entry and

keep track of it as the branch moves down the execution pipeline. For instance, a processor

Page 100: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

83

*8��"��/#/

�-���#��)��

2����

������

#�$��

�,�����

�� �� �5�

���-�

2����

����(

,

�,�����

�� �� �5�

67

�"�

*�.��%��� �&�� ��-���"

������8� �

67

�"�

*�.��%���" �&�� ��-���"

������8� �

,

%���)��� �&

%���)��� +&

%���)��� �&

%���)��� +&

-��� �� %���" �&

� ��-���"

������8� �

/�� ��

!� �"��

4�

�����

Figure 5.7: Block Diagram for Dual Block Prediction Using Double Selection

Page 101: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

84

�����

$��=�$+

���(

$+��"�

�� �����

�����

$9�=�$�

�����

$:�=�$<

���(

$���"�

�� �����

���(

$<��"�

�� �����

�������������� ������������ ������������ ��������������

>�����

$�

�����

>�����

$+

�����

>�����

$�

�����

>�����

$<

�����

>�����

$9

�����

>�����

$:

�����

��������������

� ���� �� ���� �

� ���� � ����

� ���� � ����

���(

$���"�

�� �����

���(

$9��"�

�� �����

���(

$:��"�

�� �����

������

$��=�$+

��������"

������

$9�=�$�

��������"

������

$:�=�$<

��������"

Figure 5.8: Pipeline Stage Diagram for Dual Block Prediction Using Double Selection

Page 102: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

85

Table 5.3: Misprediction Penalties

Misprediction Single Select Double Select

�st block �nd block �st block �nd block

Conditional branch �� 5 �� 5

Return 4 5 4 5

Misfetch indirect 4 5 4 5

Misfetch immediate 1 2 1 2

Misselect N/A 1 1 2

GHR N/A 1 1 2

BIT 1 1 N/A N/A

I-cache bank conflict 0 1 0 1

* Add one cycle if instructions remain and need to be re-fetched.

could use a table with a fixed number of BBR entries, and if there are more unresolved

branches than entries, the processor would stall. Alternatively, the processor could store

this information with each instruction in an instruction window. Some processors, such as

the SDSP, have unused storage in an instruction window entry for a branch instruction, and

this extra space is used to store recovery information [48].

Table 5.4 lists a description and sizes of the fields in a recovery entry. A recovery

entry is created after a conditional branch is predicted using BIT and PHT information.

When a prediction is made for a conditional branch, another prediction is made assuming

its original prediction is incorrect. If a branch is predicted not taken, then the alternate

target address is the branch’s target address. If it is predicted taken, then the alternate

address is the next control transfer or fall-through address in its block (see the example in

Page 103: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

86

Table 5.2). The alternate target address is entered into the recovery entry. In addition, a

replacement selector and new GHR are generated.

Table 5.4: Bad Branch Recovery Entry

Bits Description

1 Block 1 or 2

1 Predicted taken or not taken

1 Second chance

8-12 PHT/ST index

2n PHT block (optional)

8-12 Corrected GHR

8-11 Replacement selector

10/30 Corrected i-cache index or full address

The PHT index is required from the recovery entry so that the counter of a pattern

history table entry can be correctly updated after a correct or incorrect branch prediction.

When a conditional branch is predicted, the counter of PHT is not immediately updated.

The 2-bit counter is stored in the BBR entry, as a predicted taken or not taken bit and a

second chance bit. When a branch instruction commits (i.e., all previous instructions and

branches have successfully completed), the PHT index from the BBR is used to update the

PHT to reflect a correct prediction. On the other hand, when a branch is discovered to be

mispredicted, the counter in the PHT is updated to reflect an incorrect prediction (see the

state transition diagram of Figure 2.4). Since a PHT entry contains counters for an entire

block of instructions, updating a single branch requires using a read/modify/write cycle.

In order to avoid a read/modify/write cycle, the original PHT block information may be

Page 104: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

87

optionally stored in the BBR entry so that a PHT update requires only one write cycle to a

PHT entry.

If the branch does not have a “second chance” when it is mispredicted, then the pre-

computed selector from the bad branch recovery entry is written into the select table.

If a misprediction occurs for the second block, then any remaining instructions from

the first block are fetched along with a new second block target retrieved from the recovery

entry. On the other hand, if the misprediction occurs for the first block, an extra cycle may

be required to fetch any remaining instructions from the previous block.

5.3 Performance

The objective of this section is to analyze the performance of the instruction fetch pre-

diction mechanisms presented thus far in this chapter. The performance was determined by

running the SPEC95 benchmark suite on the SPARC architecture. Each program ran for the

first one billion instructions. The performance for each suite, SPECint95 and SPECfp95, is

calculated by adding the results of each program in its respective suite.

To begin with, the conditional branch accuracy of predicting multiple branches in

a single block via a blocked PHT is compared against a non-blocked PHT of equal size.

Next, the impact of incorrect BIT information on performance is presented. Section 5.3.3

examines the performance of dual block prediction by comparing the performance of single

selection and double selection with different sized PHTs and STs. There are many options

for implementing a target array: BTB or NLS, number of entries, and near-block target pre-

diction. The performance and relationship of these options are studied. In Section 5.3.5, the

performance of single block fetching and two-block fetching is compared using different

Page 105: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

88

types of instruction caches. In addition, a breakdown of the different misprediction penal-

ties is shown for two-block fetching for both single selection and double selection. Finally,

performance using two-block prefetching is compared using different decode widths.

All the results presented use a block width of eight (n � �). Single selection is used

for dual block prediction unless otherwise noted. The results presented only use a global

adaptive branch prediction scheme using one global blocked pattern history table. The

default size of a select table is 1024 entries, which corresponds to a GHR length of 10 bits.

The size of the RAS is 32 entries. It was assumed the processor would always have bad

branch recovery entries available.

Instruction cache misses were not simulated, i.e., a perfect instruction cache was

assumed. All the results presented would have their performance lowered if instruction

cache misses are included. The objective of this section is to examine the performance of

instruction fetch and prediction mechanisms only. Therefore, cache misses and execution

stalls are not considered. The only consideration for the instruction cache were the line size

and bank conflicts. A line size equal to the block width was used, and the instruction cache

was split into eight banks.

The default target array is a 256-entry NLS array. The set prediction was not simu-

lated. Therefore, the results presented for the NLS configuration are really a direct-mapped

tag-less BTB. The performance of a real NLS is affected by the associativity of an instruc-

tion cache, since it may incorrectly predict the set of the cache. Direct-mapped caches,

though, do not need set prediction. For a performance and cost comparison of an NLS

verses a BTB, please refer to [7]. Also, by default, near-block target prediction is not used.

Page 106: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

89

The branch execution penalty (BEP) gives information regarding performance and

the interaction between many different types of penalties, as listed in Table 5.3. Never-

theless, all types of penalties are recorded, so that the contribution of each penalty type

towards the overall BEP can be shown. If multiple penalties overlap during fetching, only

the most significant penalty is recorded. For example, if a conditional branch is mispre-

dicted, it is irrelevant if there is a misselect on subsequent blocks. All of those instructions

will be invalidated once the branch is resolved. Overall performance is best understood

from the effective instruction fetch rate. One cannot directly compare a scalar BEP with

a superscalar BEP or a multi-block BEP since higher penalties are overcome by increased

number of instructions per successful fetch block.

Also, when fetching two blocks per cycle of potentially eight instructions each, up to

sixteen instructions may be returned in one cycle. Consequently, the effective instruction

fetching rate, IPC f can be greater than n. If an eight issue processor is used, then extra

instructions returned can be buffered. This would correspond to a two-block prefetching

scheme with n � � and q � �� as presented in Chapter 4. When the raw two block rate

is greater than n, the issue unit will usually receive, and average close to, n instructions

per request. Of course, a simpler configuration to satisfy issue unit constraints in such a

situation would be to use two blocks of four instructions each. This would still yield an

excellent fetching rate. By default, prefetching is not used with two-block fetching.

5.3.1 Conditional Branch Accuracy

To begin with, the conditional branch accuracy of a blocked PHT for multiple branch

prediction was evaluated. The branch history length varied from 6 to 12. The results were

compared to a per-addr scalar PHT (GAp) with 8 PHTs (see Figure 2.5) to give it equal

Page 107: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

90

size of a blocked PHT for n � �. Figure 5.9 displays the branch misprediction rates and

the improvement over a scalar PHT. The difference in accuracy between the scalar and

blocked schemes across all variations were small, and the accuracy favored the blocked

PHT scheme for most programs. The accuracy of SPECint95 averaged 91.5% while the

accuracy of the SPECfp95 averaged 97.3%, using a GHR length of 10. In this case, the

blocked PHT had a better accuracy by a few hundredths of a percent for SPECfp95 and a

few tenths of a percent for SPECint95. The results also show that the accuracy of a blocked

PHT is more sensitive with small GHR lengths. Consequently, the accuracy may not be as

good as a scalar PHT.

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

6 8 10 12

Branch History Length

Mis

pre

dic

tio

n Im

pro

vem

ent

ove

r sc

alar

(%

)

-4

-2

0

2

4

6

8

10

12

14

16

Mis

pre

dic

tio

n R

ate

%

Int - improvement

FP - improvement

Int -miss rate

FP - miss rate

Figure 5.9: Branch Misprediction Rate and Improvement

Page 108: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

91

5.3.2 Block Information Type

Correct instruction type information for a block is critical to making accurate predic-

tions. Incorrect BIT information can still result in a correct prediction, but this possibility

is reduced with larger block sizes. Different BIT table sizes were simulated to evaluate

its impact. Using single block fetching, Figure 5.10 shows the BEP contribution from

inaccurate BIT information. Also shown is the IPC f. Small sized BIT tables result in

poor performance. Only until about 2048 entries does the percentage of BEP drop below

5%. Therefore, for smaller sized instruction caches, it may be more beneficial to store the

BIT information inside the instruction cache. Conversely, a separate BIT table would be

more cost effective because the one cycle miss penalty of the BIT is much lower than an

instruction cache miss.

The results demonstrate that it is important to use a BIT table sufficiently large to

make the impact of inaccurate BIT information small or guarantee accurate BIT informa-

tion. The rest of the results presented use two blocks and assume BIT information is stored

in the instruction cache, or there are enough BIT entries in a separate table as blocks in the

instruction cache to prevent using incorrect BIT information.

5.3.3 Single vs. Double Selection

The performance of the select table depends on the branch history length and the

number of select tables used. Multiple select tables are indexed by the starting position

of the current address (the least significant bits). The correct target depends on the en-

tering position in a block, so multiple select tables help identify which target should be

selected. The least significant bits of the starting address determine which select table is

Page 109: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

92

0

0.05

0.1

0.15

0.2

0.25

0.3

64 128 256 512 1024 2048 4096BIT block entries

BE

P

0

1

2

3

4

5

6

IPC

_f

Int - BEP

FP - BEP

Int - IPC_f

FP - IPC_f

Figure 5.10: Block Information Type Penalty and Performance

Page 110: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

93

used. Figure 5.11 shows the performance of dual block prediction for single selection (SS)

and double selection (DS). The global history register length varies from 9 to 12. There

can be 1, 2, 4, or 8 STs. However, there are not multiple PHTs. The results demonstrate

that increasing the number of STs improves performance as well as increasing the branch

history length. The extra penalties from using double selection significantly reduced perfor-

mance, roughly 10% for most cases. Hence, single selection is preferred. Double selection

significantly improves, though, with more STs.

3

4

5

6

7

8

9

10

9/1 9/2 9/4 9/8 10/1 10/2 10/4 10/8 11/1 11/2 11/4 11/8 12/1 12/2 12/4 12/8Branch History Length / # Select Tables

IPC

_f

Int/SSInt/DSFP/SSFP/DS

Figure 5.11: Single and Double Selection Performance

Page 111: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

94

5.3.4 Target Arrays

Target arrays can use a BTB or NLS. In addition, if a near-block target is used, this

will reduce the number of immediate targets used in the target array. Table 5.5 shows the

percentage of BEP due to indirect and immediate misfetches for SPECint95. The total BEP

and IPC f are also reported. The number of block entries is varied for both NLS and a 4-

way BTB using LRU replacement algorithm. A BTB entry can be for the first or second

target, while an NLS entry has two separate targets. The data indicates that eight NLS

block entries are needed for comparable performance of one 4-way BTB entry because the

BTB is 4-way associative while the NLS is direct-mapped. About 70% of the conditional

branches are near-block targets. As a result of using near-block encoding, the number of

BTB or NLS entries can be reduced in half for about the same performance.

5.3.5 Instruction Cache Configurations

The performance can be dramatically improved if a different type of instruction cache

configuration is used, as described in Chapter 4. To increase the number of instructions per

block (IPB), the cache line size can be extended to 16 instructions. For the highest possible

IPB, a self-aligned cache should be used. If a self-aligned cache is used though, the number

of banks should be doubled to offset the increase in bank conflicts, since up to four lines are

being simultaneously accessed to return two blocks. Although there are no bank conflicts

with single block fetching, the extended and self-aligned caches improve the instructions

fetched per block and overall fetching performance.

Page 112: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

95

Table 5.5: Indirect and Immediate Misfetch Penalty Comparison for Different Target ArrayConfigurations

Target # block near- %BEP misfetch BEP IPC f

Type entries block? imm. indirect

BTB 8 no 19.2 18.7 0.603 5.02

BTB 8 yes 10.6 16.3 0.520 5.40

BTB 16 no 12.6 15.1 0.523 5.32

BTB 16 yes 6.5 12.6 0.476 5.57

BTB 32 no 7.4 11.6 0.473 5.58

BTB 32 yes 3.6 9.6 0.446 5.73

BTB 64 no 4.0 9.6 0.447 5.72

BTB 64 yes 1.9 7.9 0.431 5.80

NLS 64 no 12.0 14.7 0.516 5.41

NLS 64 yes 6.7 13.1 0.480 5.54

NLS 128 no 8.3 12.3 0.481 5.53

NLS 128 yes 4.2 10.8 0.454 5.67

NLS 256 no 5.5 10.1 0.457 5.66

NLS 256 yes 2.7 8.7 0.438 5.77

NLS 512 no 3.8 9.2 0.444 5.74

NLS 512 yes 1.6 7.9 0.429 5.81

Page 113: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

96

With the extended and self-aligned caches, when branch prediction is performed,

the values wrap around the PHT block. For instance, if the starting position of an eight-

instruction wide block is at address 7, the first instruction will use the last (eighth) counter

in a PHT block, and the second instruction will wrap-around the PHT block and use the

first counter in that PHT block. Also, the target arrays must be correspondingly extended or

self-aligned. The performance of these three cache types are compared using one and two

block fetching with single selection. The results are shown in Table 5.6, using 8 STs and a

branch history length of 10. Outstandingly, the self-aligned cache achieves 10.9 IPC f for

SPECfp95. It averages over 8 IPC f for the entire SPEC95 suite. The high performance is

primarily due to the increase in IPB. Also, the starting address becomes more random which

helps associate a select table and use it efficiently. The performance of the extended cache

type is between a normal and self-aligned cache. Compared to single block prediction,

dual block prediction has an effective fetching rate approximately 40% higher for integer

programs and 70% higher for floating point programs.

Table 5.6: IPB and IPC f for Different Cache Types

SPECint95 SPECfp95

cache line IPB IPC f IPB IPC f

type size banks 1 block 2 block 1 block 2 block

normal 8 8 5.01 3.96 5.66 5.81 5.48 9.43

extended 16 8 5.30 4.12 5.87 6.03 5.65 9.80

self-aligned 8 16 5.99 4.53 6.42 6.76 6.33 10.88

Using a self-aligned cache, 8 STs, and a branch history length of 10, Figure 5.12

shows the BEP of each program and the contribution of BEP by each type of misprediction

as described in Section 5.2.3. Also, Table 5.7 shows the BEP distribution for each block.

Page 114: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

97

These are for single selection, while Figure 5.13 and Table 5.8 show the distribution for

double selection. The effective instruction fetching rate is proportional to the number of

instructions per block (IPB) and inversely proportional to the product of the average branch

execution penalty and total number of branches executed. As a result, a program with a

lower BEP may have a smaller IPC f because it executes more branches.

The BEP distribution from those figures show that the most significant BEP contribu-

tion is from misprediction of conditional branches. Misselection is the next most significant

contribution. Target array mispredictions are also a significant factor in BEP. Some of the

floating point programs performed exceedingly well. On the other hand, some integer pro-

grams had a high BEP because of poor conditional branch prediction.

5.3.6 Prefetching

A prefetch buffer, as described in Chapter 4, can be used in conjunction with two-

block prediction and single selection. Table 5.9 shows the performance for different decode

(issue) sizes from 4 to 16. A global history register length of 12, one select table, and a 512-

entry NLS target array were used. The instructions per fetch request (IPFQ) and effective

instruction fetching performance are shown. The IPFQ is the average number of instruc-

tions returned to the decoder including penalties from misselection, misfetching, and bank

conflicts, but not penalties from branch prediction, indirect branches, and returns. This is

to demonstrate how well the instruction fetch mechanism including instruction fetch pre-

diction can deliver instructions to the decoder. Instructions fetched from the incorrect path

are the result of incorrect branch prediction; its accuracy is equivalent to scalar prediction.

Table 5.9 shows that the IPFQ is relatively close to the decode size up to about 14. These

Page 115: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

98

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

appl

u

apsi

fppp

p

hydr

o2d

mgr

id

su2c

or

swim

tom

catv

turb

3d

wav

e5

CF

P95

CIN

T95 gc

c

com

pres

s

go

ijpeg

li

m88

ksim

perl

vort

ex

BE

P

bank conflictreturnmisfetch indirectmisfetch immediateghrmisselectmispredict

Figure 5.12: Branch Execution Penalties for Dual block, Single Selection

Page 116: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

99

Table 5.7: BEP Distribution, IPB, and IPC f for Dual Block, Single Selection

Program Block 1 Block 2 BEP IPB IPC f

cnd ind imm ret cnd sel ghr ind imm ret bnk

applu .054 .000 .000 .000 .068 .010 .008 .000 .000 .000 .009 .149 7.28 12.87

apsi .041 .000 .001 .006 .046 .034 .022 .001 .004 .007 .020 .183 7.68 14.10

fpppp .101 .001 .001 .000 .121 .059 .032 .009 .000 .000 .016 .340 7.71 14.19

hydro2d .005 .012 .006 .000 .007 .020 .002 .027 .009 .000 .003 .091 6.34 11.17

mgrid .103 .000 .000 .000 .121 .075 .007 .001 .001 .000 .009 .318 7.86 14.85

su2cor .022 .007 .032 .000 .026 .047 .009 .021 .060 .000 .015 .240 5.76 7.46

swim .029 .000 .000 .000 .032 .025 .002 .007 .000 .000 .015 .110 7.61 14.65

tomcatv .033 .002 .017 .000 .041 .029 .004 .016 .032 .000 .012 .185 5.92 8.37

turb3d .060 .003 .003 .000 .071 .069 .006 .011 .016 .000 .034 .272 6.21 9.56

wave5 .066 .002 .000 .007 .079 .067 .005 .017 .050 .009 .036 .337 6.46 9.31

CFP95 .037 .004 .013 .001 .044 .040 .007 .016 .031 .001 .016 .211 6.76 10.88

gcc .173 .020 .044 .003 .205 .066 .013 .056 .061 .004 .008 .653 5.61 4.40

compres .235 .000 .000 .000 .289 .068 .032 .000 .000 .000 .006 .631 6.43 5.43

go .348 .018 .016 .001 .409 .132 .034 .053 .030 .001 .011 1.052 6.43 4.40

ijpeg .152 .001 .002 .000 .185 .042 .007 .005 .002 .000 .001 .397 7.03 9.44

li .056 .002 .019 .003 .070 .021 .006 .014 .025 .004 .012 .232 5.35 6.88

m88ksim .055 .005 .011 .000 .066 .020 .005 .009 .009 .000 .009 .187 5.82 8.60

perl .063 .011 .048 .004 .076 .055 .006 .028 .070 .005 .029 .395 5.59 6.08

vortex .037 .010 .009 .004 .044 .051 .011 .030 .017 .005 .014 .232 5.80 7.77

CINT95 .123 .007 .018 .002 .149 .053 .014 .022 .026 .003 .013 .429 5.99 6.42

Page 117: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

appl

u

apsi

fppp

p

hydr

o2d

mgr

id

su2c

or

swim

tom

catv

turb

3d

wav

e5

CF

P95

CIN

T95 cc

1

com

pres

s

go

ijpeg

li

m88

ksim

perl

vort

ex

BE

P

bank conflictreturnmisfetch indirectmisfetch immediateghrmisselectmispredict

Figure 5.13: Branch Execution Penalties for Dual Block, Double Selection

Page 118: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

101

Table 5.8: BEP Distribution, IPB, and IPC f for Dual Block, Double Selection

Program Block 1 Block 2 BEP IPB IPC f

cnd sel ghr ind imm ret cnd sel ghr ind imm ret bnk

applu .054 .012 .001 .000 .000 .000 .068 .019 .016 .000 .000 .000 .009 0.180 7.28 12.57

apsi .041 .039 .017 .000 .001 .006 .046 .068 .044 .001 .004 .007 .020 0.295 7.68 13.42

fpppp .101 .048 .033 .001 .001 .000 .121 .117 .065 .009 .000 .000 .016 0.512 7.71 13.63

hydro2d .005 .017 .001 .012 .006 .000 .007 .039 .003 .027 .009 .000 .003 0.131 6.34 10.62

mgrid .103 .056 .016 .000 .000 .000 .121 .149 .015 .001 .001 .000 .009 0.472 7.86 14.46

su2cor .022 .035 .005 .007 .032 .000 .026 .093 .019 .021 .060 .000 .015 0.336 5.76 6.54

swim .029 .005 .000 .000 .000 .000 .032 .050 .004 .007 .000 .000 .015 0.142 7.61 14.50

tomcatv .033 .018 .001 .002 .017 .000 .041 .057 .008 .016 .032 .000 .012 0.238 5.92 7.72

turb3d .060 .037 .010 .003 .003 .000 .071 .137 .011 .011 .016 .000 .034 0.393 6.21 8.67

wave5 .066 .046 .005 .002 .000 .007 .079 .135 .009 .017 .050 .009 .036 0.460 6.46 8.45

CFP95 .037 .028 .005 .004 .013 .001 .044 .080 .013 .016 .031 .001 .016 0.291 6.76 10.13

gcc .173 .046 .005 .020 .044 .003 .205 .133 .027 .056 .061 .004 .008 0.785 5.61 3.92

compres .235 .035 .013 .000 .000 .000 .289 .137 .064 .000 .000 .000 .006 0.778 6.43 4.78

go .348 .126 .024 .018 .016 .001 .409 .265 .068 .053 .030 .001 .011 1.369 6.43 3.68

ijpeg .152 .015 .003 .001 .002 .000 .185 .083 .014 .005 .002 .000 .001 0.463 7.03 8.95

li .056 .006 .000 .002 .019 .003 .070 .042 .012 .014 .025 .004 .012 0.265 5.35 6.55

m88ksim .055 .008 .000 .005 .011 .000 .066 .039 .010 .009 .009 .000 .009 0.220 5.82 8.23

perl .063 .024 .002 .011 .048 .004 .076 .111 .011 .028 .070 .005 .029 0.482 5.59 5.53

vortex .037 .033 .010 .010 .009 .004 .044 .102 .022 .030 .017 .005 .014 0.337 5.80 6.76

CINT95 .123 .033 .007 .007 .018 .002 .149 .107 .027 .022 .026 .003 .013 0.536 5.99 5.76

Page 119: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

102

results indicate that the instruction fetch mechanism and fetch prediction can sustain an ad-

equate instruction fetching rate, but branch mispredictions restrict the effective instruction

fetching performance.

Table 5.9: Two-block Prediction with Prefetching for Different Decode Sizes

Decode Size

Suite Metric 4 6 8 10 12 14 16

Int IPFQ 3.90 5.73 7.75 8.86 9.88 10.5 10.7

IPC f 3.20 4.38 5.71 6.39 6.91 7.22 7.35

FP IPFQ 3.98 5.96 7.91 9.69 11.2 12.5 13.2

IPC f 3.84 5.64 7.39 8.91 10.3 11.2 11.8

5.4 Multiple Block Prediction

In addition to the two-block prediction, multiple blocks can be predicted by a simple

extension of the two-block prediction. Given the current block address, the first block of

the next fetch cycle is predicted using BIT and PHT information (single selection) or using

a select table entry (double selection). The second and remaining blocks of the next fetch

are predicted using the select table. Instead of the select table entry providing select bits for

the first and/or second block, it provides select bits for all blocks. The relationship between

the current fetch cycle and predicting multiple blocks for the next fetch cycle is illustrated

in Figure 5.14.

When the second and remaining blocks are predicted from the select table, they

are all verified at the same time. This is done during the next cycle for single selection

(see Figure 5.6, Verify b2 select stage) or the cycle after that for double selection (see

Page 120: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

103

'�'�'�'�'�'�'

/�"�(� �

%�� �� $�"�(� �-��� ���

$�� )� �������&

'�'�'�'�'�'�' '�'�'�'�'�'�' '�'�'�'�'�'�''�'�'

$�� ��� �")��� %��&

"�� ������ ��$��� %��&

/�"�(� �

�������� ?��� ������ ��$��

�������� ?��� ������ ��$��

�������� ?��� ������ ��$��

/�"�(� + 8� ��/�"�(

'�'�'

�-��� �� ������ ����� �.�� ������ �����

Figure 5.14: Predicting Multiple Blocks

Figure 5.8). In addition, the target array is expanded to provide targets for multiple blocks.

Also, another read/write port to the PHT and BIT tables is needed for each additional block

predicted.

The effective instruction fetching performance from a single block to four blocks

is shown in Figure 5.15. A global history register length of 12, one select table, and a

512-entry NLS target array were used. The floating point benchmarks showed remarkable

increases in fetching performance, achieving 16 IPC f with four-block fetching. On the

other hand, the improvement from the integer benchmarks was not as impressive after

two-block fetching, because poorer branch prediction accuracy inhibits its performance

potential. In addition, as more blocks are predicted per cycle, the accuracy of the selection

table decreases, eventually leading to negligible improvement.

Page 121: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

104

0

2

4

6

8

10

12

14

16

1 2 3 4Blocks per cycle

IPC

_f

Int

FP

Figure 5.15: Effective Instruction Fetch for Different Block Prediction Capability

Page 122: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

105

5.5 Cost Estimates

This section presents cost estimates of multiple branch and block prediction in terms

of hardware storage and timing requirements. Using simplified hardware cost estimates, the

amount of hardware storage is evaluated for single block fetching and dual block fetching

with single and double selection. Also, using a timing model for a 0.5�m CMOS technol-

ogy, timing estimates are given for each structure used in dual block prediction. Timing

charts show how the structures relate to the critical path. Single selection and double selec-

tion are compared based on hardware and timing requirements.

5.5.1 Storage

Simplified hardware cost estimates were developed to get an idea of the type of stor-

age requirement for the different pieces of multiple branch and block prediction. Given the

block width, history register length, number of PHTs, number of select tables, size of NLS,

and type of instruction cache, the total number of bits required can be estimated. Table 5.10

lists the parameters and equations which give simplified hardware cost estimates for the

PHT, ST, NLS, BIT, and BBR tables. Single block fetching requires the use of a PHT, NLS

target array, BIT table, and a BBR. Dual block prediction with single selection requires the

use of a PHT, ST, two NLS target arrays, BIT table, and a BBR. Dual block prediction with

double selection requires the use of a PHT, two STs, two NLS target arrays, and a BBR.

Multiple block prediction in excess of two blocks requires a corresponding number of STs

and NLS target arrays.

Here is an example of a hardware cost estimate using specific values for the parame-

ters in Table 5.10. Using a block width of 8, a 32 Kbyte direct-mapped instruction cache, a

Page 123: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

106

Table 5.10: Simplified Hardware Cost Estimates

Symbol Description

n block width

k history register length

p number of PHTs

s number of Select Tables

t number of NLS block entries

l size of line index

a cache associativity

b number of BBR entries

i number of BIT block entries

Table Simple hardware cost estimate

PHT p� �k � n� �

ST s� �k � �� �log�n� ��

NLS t� n� �l�log�a�

BIT i� n� �

BBR b� ��k � l � �log�a� �log�n� ��

Page 124: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

107

10-bit GHR, 1 PHT, 1 ST, 256 NLS entries, 1024 BIT entries, and 8 BBR entries, the cost

estimates evaluate to:

� PHT: 16 Kbits

� ST: 8 Kbits

� NLS: 26 Kbits

� BIT: 16 Kbits

� BBR: .3 Kbits

� single block total: 58 Kbits

� dual block, single select total: 92 Kbits

� dual block, double select total: 84 Kbits

When single selection is used, a BIT table with an equal number of entries as rows in

an instruction cache should be used to avoid any BIT mispredictions. As a result, the cost

of predicting the next block based on PHT and BIT information, as with single selection,

will increase when the size of the instruction cache increases. A total cost comparison for

different instruction cache sizes using single block prediction, dual block prediction with

single selection, and dual block prediction with double selection is shown in Figure 5.16.

A block width of 8, a 10-bit GHR, 1 PHT, 1 ST, 256 NLS entries, and 8 BBR entries were

used in evaluating these estimates. In comparing double and single selection, after a cache

size of 16 Kbytes (512 BIT entries), double selection requires a smaller bit storage cost

than single selection. In fact, eventually the cost of BIT storage results in the cost of single

fetch prediction greater than dual block prediction with double selection.

Page 125: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

108

0

20000

40000

60000

80000

100000

120000

140000

160000

2k 4k 8k 16k 32k 64k 128kcache size (bytes)

bit

s

1 block2 block, SS2 block, DS

Figure 5.16: Hardware Storage Cost of Prediction for Different Cache Sizes

Page 126: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

109

The cost difference between single and double selection is not solely dependent on

BIT storage. The most significant factor is the size of the select tables. Since double selec-

tion requires a select table twice as large as single selection, the cost of double selection can

easily become much larger than single selection. The total hardware storage cost is shown

in Figure 5.17, where the length of the GHR and the number of select tables is varied for

dual block prediction with single selection and double selection. A block width of 8, a 32

Kbyte direct-mapped instruction cache, 1 PHT, 256 NLS entries, 1024 BIT entries, and 8

BBR entries were used in evaluating these estimates. The graph demonstrates that the cost

of double selection is less than the cost of single selection for small history register lengths

and few select tables. After a history register length of 10, the cost of double selection is

always greater than single selection. For both single and double selection, though, the cost

of using large select tables becomes excessive. Implementing a prediction mechanism with

a cost greater than 200 Kbits is not reasonable for today’s technology.

In summary, double selection can provide cost savings over single selection with a

small select table or with a large instruction cache. The cost savings, however, are not

extremely significant, and in most cases the cost of double selection is greater. Given these

cost estimates and the performance results of Section 5.3.3, the performance loss from

double selection does not justify a small storage savings. As the next section will show,

though, double selection may prove invaluable in reducing the cycle time.

5.5.2 Timing

Dual selection and single selection have different timing requirements. In order to

evaluate these requirements, the timing model of Wilson and Jouppi [51] is used to make

access and cycle time estimates of the different tables, caches, and logic required. The

Page 127: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

110

0

100000

200000

300000

400000

500000

600000

700000

9/1 9/2 9/4 9/8 10/1 10/2 10/4 10/8 11/1 11/2 11/4 11/8 12/1 12/2 12/4 12/8

Branch History Length / # Select Tables

bit

s

Single SelectDouble Select

Figure 5.17: Hardware Storage Cost of Dual Block Prediction for Single and DoubleSelection

Page 128: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

111

technology and implementation parameters are identical to what is used in their report,

except the results are factored for a 0.5�m CMOS technology instead of a 0.8�m CMOS

technology. When the access and cycle times of a tag-less table are estimated, the tag

side of their cache model is ignored and only the data side is considered. Select logic and

multiplexer delays were also estimated by applying the Horowitz approximation Wilson

and Jouppi used [18].

Table 5.11 displays the access time of direct-mapped caches, BIT table, PHT, ST,

NLS target array, and 4-way BTB target array for different sizes. The access time estimates

for the BTB are about 2 ns greater than the estimates for the NLS and are higher than

the access time for direct-mapped caches. The access time for the associative BTB is

greater because of the required tag matching. Consequently, in order to design a fetching

mechanism with a short cycle time, the NLS target array is preferred.

Table 5.11: Access Time Estimates (ns)

i-cache BIT PHT ST (DS) NLS 4-way BTB

size time entries time entries time entries time entries time entries time

8 KB 3.2 256 2.1 512 2.2 512 2.1 64 2.0 8 4.3

16 KB 3.8 512 2.1 1024 2.3 1024 2.2 128 2.2 16 4.4

32 KB 3.9 1024 2.3 2048 2.7 2048 2.6 256 2.3 32 4.5

64 KB 4.7 2048 2.7 4096 3.3 4096 3.1 512 2.5 64 4.6

Single Selection

Figure 5.18 is a timing chart for a direct-mapped 8 Kbyte instruction cache using dual

block prediction with single selection. The chart includes timing of a 256-entry NLS target

Page 129: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

112

array, a 1024-entry select table, a 1024-entry blocked PHT, and a 256-entry BIT table. The

BIT table has enough entries to avoid any BIT mispredictions, since the instruction cache

has 256 rows with a block size of 32 bytes. All the structures use a block width of eight

and a single port, except PHT and BTB which are dual ported.

@�A/��,�����

*8�

/�#

�!#

������ �")��

� ��67

+ ��67

9'+� ��

+'9�

+'��

�':�

+'9�

���� �������)�

�':�

�':�

������ ������ �'+�

+'�� ������� #�$��

�';� ���������)� �

Figure 5.18: Timing Chart for 8 KB Instruction Cache Using Dual Block Prediction withSingle Selection

The access time of the cache is 3.2 ns, which is fast compared to a 4.7 ns access

time of a 64 Kbyte cache. The instruction cache requires a precharge time of about 1

ns, which makes the cycle time greater than the access time. This allows enough time

Page 130: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

113

to discharge the word lines and precharge the bit lines. During this time, alignment of

instructions may take place as well as the prediction of the new PC addresses. The select

logic requires BIT and PHT block information. Hence, it may not begin computing until

both have completed reading. As shown in the chart, the PHT requires 2.3 ns to read its

data. The select logic, as shown in Figure 5.4, takes approximately 0.5 ns to complete.

After this time, the control logic is ready for the first multiplexer to select from the NLS

target array, RAS, or fall-through address (see Figure 5.5). Some of the inputs for the

second multiplexer are dependent on the output of the first multiplexer. As a result, an

additional 0.2 ns is required to complete the selection of the second target. Even with the

select logic, the prediction of both targets is completed with a comfortable margin of 0.7

ns. Only if the cycle time of the instruction cache significantly decreases or the access time

of the BIT or PHT increases would prediction using single selection become the critical

path. If this is the case, double selection may be the solution.

Double selection can avoid the extra delay required by selection logic. Unfortunately,

double selection always performs significantly poorer than single selection. On the other

hand, if the selection logic required with single selection extends the cycle time of the

processor, the performance savings from a longer cycle time may justify the extra penalty

cycles of double selection. As shown in Figure 5.18, the selection logic may become part

of the critical path if the access time of the instruction cache decreases. This may be

accomplished with a cache size less than or equal to 2 Kbytes. It is unlikely, however, that

a designer with the high transistor budgets today’s technology provides would implement

a small instruction cache and use PHT and STs larger than the primary cache itself.

Page 131: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

114

Double Selection

The potential benefit from double selection most likely may be exploited using a

pipelined instruction cache access which completes in two cycles, as used in the Intel

Pentium Pro [30]. For example, using a large primary instruction cache of 32 Kbytes,

the access time will increase to 3.9 ns and the cycle time will increase to 5.3 ns. The 25%

increase in cycle time compared to an 8 Kbyte cache may not justify the decrease in in-

struction cache miss penalties. On the other hand, the cycle time may be reduced in half

if the instruction cache access spans two cycles. In order to retain the same instruction

fetched per cycle throughput, dual block prediction needs to complete in one cycle. Also,

four banks will be busy during one cycle instead of two, so this increases the possibility of

a bank conflict.

One possibility to keep bank conflicts under control is to use the sets of a set-asso-

ciative instruction cache in addition to interleaving the cache based on address to provide

multiple banks. For example, sixteen banks may be chosen from four sets, where each set

is interleaved four ways based on the address. When using next line and set prediction, the

set is predicted before the access of the cache line. Therefore, it is unnecessary to access

all sets and select the correct one after a tag comparison. Consequently, only the predicted

set needs to be accessed, leaving the remaining sets for use by other blocks to be fetched at

the same time. When an instruction cache miss occurs, the line is replaced into a set such

that it would not create a conflict with other lines when it was initiated. The next time the

line is accessed, it is likely that it will be accessed with the same lines as before. Hence,

the chance of a bank conflict is significantly reduced.

Page 132: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

115

Using a shortened cycle time from a two-cycle instruction cache access and a single

cycle dual block prediction, the select logic from single selection becomes part of the crit-

ical path. For example, using a 32 Kbyte cache completes in two cycles with each cycle

lasting 2.7 ns. Referring back to the timing chart of Figure 5.18, the dual block predic-

tion completes after 3.5 ns. As a result, using single selection with a pipelined instruction

cache would require increasing the cycle time from 2.7 ns to 3.5 ns to meet the objective

of dual block prediction in a single cycle. Alternatively, if dual selection is used, the 0.5 ns

from the selection logic no longer exists. In addition, the multiplexer selection bits become

ready 0.2 ns earlier from a faster select table than pattern history table. As a result, the dual

block prediction can be completed within 2.8 ns, which comes close to the 2.7 ns goal. The

timing chart for this pipelined 32 Kbyte instruction cache using dual block prediction with

double selection is shown in Figure 5.19.

The timing chart of Figure 5.19 shows four instruction cache accesses: two are ini-

tiated during Cycle 0 and two are initiated during Cycle 1. The prediction for the lines

in Cycle 1 was completed in Cycle 0 by using the select table. Also shown in the chart

is the prediction for Cycle 2 made during Cycle 1. The dual block prediction determines

the cycle time of 2.8 ns. The access of the first two lines completes 1.1 ns during the sec-

ond cycle. Approximately 1.7 ns is available to perform instruction alignment and block

merging required by two-block fetching.

Indeed, double selection can be useful in reducing the cycle time when the prediction

is time-critical. As demonstrated by this example, double selection reduced the cycle time

by 20% over the cycle time required by single selection. An overall instructions per second

performance increase of about 10% is expected after considering a 10% loss in IPC f.

Page 133: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

116

�,�����3

$� (�9

*8�

�!#

� ��67

+ ��67

9'B� ��

+'9�

+'9�

���� �������)�

�':�

�':�

������ ������ +'@�

+'�� ������� #�$��

������ ������ +'@�

������?������':�

�,�����3

$� (��9'B� ��

���� �������)�

�,�����3

$� (�<

�,�����3

$� (��

$�)� � ����

$�)� � ����

� ��'� ���)�� �

+'9�

+'9�

�':�

�':�

+'��

�������������� ������������

Figure 5.19: Timing Chart for Pipelined 32 KB Instruction Cache Using Dual BlockPrediction with Double Selection

Page 134: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Chapter 6

Scalable Register File

A large branch execution penalty may result from register renaming using a mapping

table. In Section 6.1, this chapter describes a hybrid renaming technique, which uses the

advantages of a reorder buffer with a mapping table, to significantly reduce the recovery

time. Section 6.2 analyzes the utilization of a register file and discovers that most of the

physical registers are not being used most of the time. This leads into Section 6.3, which

describes dynamic result renaming. The implementation details of the hybrid renaming

mechanism and dynamic result renaming are described in Section 6.4. Lastly, Section 6.5

presents the performance of the scalable register file architecture. �

6.1 Register Renaming

Section 2.6 gave background information regarding register renaming. Registers can

be renamed from logical to physical register using a mapping table. Unfortunately, a mis-

predicted conditional branch or an exception may result in a large penalty to recover the

mapping table. Section 6.1.2 introduces a hybrid renaming technique to reduce this penalty.

�Parts of this chapter were published in the proceedings of the 1996 Conference on Parallel Architectures

and Compilation Techniques [46]

117

Page 135: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

118

The hybrid renaming technique uses both content addressable memory (CAM), as used in

a reorder buffer, and random access memory (RAM), as used with a mapping table. The

number of ports for the CAM and RAM cells can be reduced by detecting the dependencies

within the decode block, as will be described in Section 6.1.3.

6.1.1 Recovery

A major performance penalty with using a mapping table and a recovery list is the

time it takes to recover from a mispredicted branch. Using the recovery list, the mapping

table can be recovered by undoing each entry in the list one at a time. The mapping table

can be updated in groups at a time, but the rate is limited by the number of read and write

ports. If P ports are available and M register mappings need to be recovered, then it will

take dM�P e cycles to recover.

Hence, the longer it takes for a branch to be found incorrectly predicted, the longer it

takes to recover the mapping table. In fact, this essentially doubles the branch misprediction

penalty, compared to a mechanism that can recover from a mispredicted branch in one

cycle.

In addition, the ports of a RAM cell used by the mapping table grow proportional

to N , since more instructions are being renamed per cycle. Therefore, it is not ideally

scalable. On the other hand, since the number of logical registers remains fixed, N would

have to be large in order to cause real problems from a practical standpoint. Nevertheless,

it would be beneficial to reduce the number of ports of the mapping table’s RAM cell.

Page 136: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

119

6.1.2 CAM/Table Hybrid

The advantage of a register renaming mechanism that uses CAM, such as a reorder

buffer [19], is a one cycle recovery time from a mispredicted branch or exception. This

is accomplished by simply invalidating appropriate entries relative to the branch. A one

cycle recovery time is desirable. However, CAM cells scale worse than RAM cells, as

used in a mapping table. To begin with, a CAM cell is more expensive and slower than a

normal RAM cell. In addition, the lookup array, which searches for the most recent register

instance, grows as the number of speculative registers increases. Therefore, it is desirable

to have the significant performance benefits of the CAM and the area benefits of RAM.

As a compromise, CAM is used to rename a limited number of speculative regis-

ters (extra registers reserved for speculative results until committed), and a mapping table,

which is implemented using RAM, renames the remaining speculative registers and the

committed registers. Figure 6.1 is a block diagram of renaming hardware which uses both

CAM and RAM. The hardware which uses CAM is called CAM lookup. The CAM lookup

is a FIFO queue. When a new instance of a logical register is created, it is inserted into

the CAM lookup. Only instructions which have a result need to be entered into the CAM

lookup. When an instance exits the CAM lookup, it is entered into the appropriate map-

ping table entry, and the old physical register is entered into the recovery list. When an

instruction commits its result, the old physical register in the recovery is freed.

Source operands are first searched for matching entries in the CAM lookup list for

the most recent destination register. If a match is not found, then the register is looked up in

the mapping table. When a mispredicted branch is resolved, if its instruction tag identifier

indicates that subsequent instructions are still in the CAM lookup entries, then there is no

additional penalty. If it has been entered into the mapping table, then the recovery list is

Page 137: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

120

used to undo the mappings. However, the penalty is not as severe since instances are first

entered into the CAM lookup. The number of entries to recover is reduced by the number

of entries in the CAM lookup. Consequently, there is a significant reduction in the average

misprediction or exception penalty.

FreeOldPReg

Bank1 2

Bank Bank Bank43

FREELI ST

DetectionIntrablock

Source Operands

......

......

Itag PReg Ready...

...

MappingTable

Operands toInstruction Window

Send renamed

Allocate PRegfor Result and Send to RF and IW

OldPReg

Next

RecoveryList

Update PRegOld PReg

Itag

CAM Lookup

Reg Itag PReg Ready

Figure 6.1: Block Diagram of Hybrid Renaming

To compare the performance benefit from a full mapping table, full CAM lookup, and

hybrid, Table 6.1 lists the average misprediction penalty and instructions per cycle (IPC)

Page 138: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

121

for CAM depths (number of entries divided by decode width) of 0, 2, 4, and 8. The mispre-

diction penalty includes all the pipeline bubbles in the various stages. A branch may wait

to be issued for several cycles. These cycles are also included, except when the pipeline

stalls due to data dependencies. In addition, an extra cycle may be included if instructions

have to be re-fetched from the mispredicted branch’s block. A significant performance im-

provement is observed with the CAM lookup because the average misprediction penalty

is reduced. After about half the total depth (CAM � �), there is a marginal improve-

ment in performance compared to a full CAM lookup (CAM � �). Therefore, the hybrid

CAM/table is a good compromise between cost and performance.

Table 6.1: Bad Branch Penalty and Performance

CAM=0 CAM=2 CAM=4 CAM=8

Arch Suite Issue Penalty IPC Penalty IPC Penalty IPC Penalty IPC

SDSP Int 4 7.7 2.63 6.4 2.70 5.8 2.74 5.5 2.75

SPARC Int 4 8.0 2.20 6.6 2.29 6.0 2.32 5.7 2.35

SPARC FP 4 8.1 1.58 6.8 1.61 6.1 1.63 5.8 1.64

SDSP Int 8 8.8 3.70 7.4 3.85 6.7 3.93 6.2 3.99

SPARC Int 8 10.0 2.70 8.4 2.84 7.5 2.91 7.0 2.96

SPARC FP 8 10.6 1.91 9.1 1.96 8.1 2.00 7.4 2.03

6.1.3 Intrablock Decoding

Many operands are dependent on a result in the same block or in the recent past.

This is shown in Figure 6.2. It shows the number of instructions between a source operand

and the creation of the register it is referencing. In a block of four instructions, each with

one operand (since about half are usually constant or not used), about 1.2 operands are

Page 139: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

122

expected to be dependent within that block. If intrablock dependencies are detected, the

number of CAM ports required to search the lookup entries may be reduced. If there is not

enough time in the decode stage to determine the intrablock dependencies, then pre-decode

bits in the instruction block can be used. Each source operand in a block requires log�N

bits to encode which of the previous N � � instructions it is dependent on, or none at all.

In addition, if an operand is dependent on an instruction which comes before the starting

position of a block, the dependency information is ignored. When each line is brought into

the instruction cache, or after the first access, the line containing a block of instructions

is annotated with �N log�N bits (for two source operands) indicating if an instruction is

dependent on another one in the same block.

0%

20%

40%

60%

80%

100%

0 1 2 4 8 16 32 640.0%

46.3%

56.6%

66.6%

75.3%

84.9%90.6% 93.3%

0.0%

27.3%

40.9%

53.1%

66.8%

76.2%82.6%

86.9%

SDSP SPARC

Percent

Instruction Distance

Figure 6.2: Dependence Distance for SDSP/SPARC

After running simulations using intrablock detection, decoding four instructions re-

quires one less CAM port to search the lookup array for equivalent performance. Instead of

Page 140: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

123

needing five or six CAM ports, now the CAM lookup can use four or five. When the block

size is doubled to eight instructions, about four out of eight register operands are expected

to be intrablock dependent. Therefore, instead of doubling the number of CAM ports when

the decode size doubles, an increase of only one port is needed – to about five or six.

Referring back to Figure 6.1, the intrablock decoding now can be done before any

CAM lookups or table mappings. AsN increases, the number of operands needed for CAM

and table lookup only increases slightly because it becomes more likely that operands will

be dependent within the same block. Hence, the hybrid CAM/table renaming scheme is

scalable.

6.2 Register File Utilization

A disadvantage with allocating a physical register at decode time is that physical reg-

isters go unused until they receive their result value. As a result, a good portion of the

register file (RF) is wasted most of the time. The total register file utilization is defined to

be the ratio of the number of physical registers with a useful value and the total number

of physical registers. In addition, the speculative register file utilization is the ratio of the

number of physical registers with a useful speculative value and the total number of physi-

cal registers reserved for speculative results (does not include committed registers). A value

is considered to be useful if it is needed to ensure proper execution of the machine. With

speculative execution and precise interrupts, this occurs from the time a register receives

its result until it is committed.

Table 6.2 shows the average speculative and total register file utilization per cycle

for 4-way and 8-way superscalar processors. The mean, median, and th percentile of

Page 141: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

124

Table 6.2: Average Register File Utilization per Cycle

Arch Suite Issue % spec % total mean median th %

SDSP Int 4 26.3 63.2 8.43 8 17

SPARC Int 4 16.0 84.0 5.11 4 12

SPARC FP 4 6.3 53.2 2.03 1 6

SDSP Int 8 24.6 49.7 15.73 15 32

SPARC Int 8 12.4 72.0 7.91 5 21

SPARC FP 8 4.8 54.4 2.80 1 9

the number of useful speculative registers are shown. Physical registers used to store the

state of the logical register file will always be active, so the total register file utilization is

not as meaningful as the speculative register file utilization. From the results presented,

it is observed that less than one quarter of the registers reserved for speculative execution

are used on the average. Less than half of the available speculative registers are used 90%

of the time. The floating point RF used in the SPARC SPECfp95 has an extremely low

utilization: less than 6%. Hence, the majority of speculative registers are going to waste

most of the time.

6.3 Dynamic Result Renaming

As has been shown, many physical registers in the register file have no value or

contain a useless value. Therefore, one way to reduce the size of the RF is to improve

its utilization. Physical registers allocated with no value can be virtually eliminated by

allocating at result write time instead of decode time. This is accomplished by splitting

the register file into multiple banks, each bank with two read ports and one write port, as

Page 142: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

125

shown in Figure 6.3. Each bank maintains its own free list (see Figure 6.1), and old physical

registers are freed when an instruction commits. In addition, a bank is directly connected

to one result bus. When functional units arbitrate for result buses, a free register is needed

in each bank. The allocation of a register cannot be done at decode time, since it is not

known exactly which functional unit and bus a result will eventually arrive. On the other

hand, by allocating the entry when results are written, multiple banks can be used with one

write port and have no conflicts with writing results into the same bank. As a result of

allocating physical registers at result write time, the size of each bank can remain constant

as the number of banks increase proportionally to the issue width.

Although allocating physical registers at write time creates no conflicts for the single

write port in a bank, the two read ports on one bank can cause contention with the instruc-

tion window. For example, three ALUs could require three operands from a single bank.

With only two read ports, one ALU would not be able to issue its instruction. Even though

this event can happen, it is not a likely event for two reasons. First, not every instruction

issued requires two register operands. Some have one operand, while others require an im-

mediate value. Second, most instructions issued bypass one of their results from the result

of an instruction completed the previous cycle. Consequently, such a limited number of

read ports per bank has a very limited impact on performance.

Table 6.3 demonstrates this fact by showing the distribution of read operand types,

and the percentage of individual operand requests failed due to insufficient ports. If the

operand is a register, then it can originate from the first or second level of bypassing, the first

or second read port of a bank, or be an identical register read from the first or second read

port. On the other hand, if it is not a register operand, then it can be an immediate value, a

zero value, or no operand at all. Interestingly, although a significant percentage of register

operands came from the first read port, few required the second read port. Furthermore, a

Page 143: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

126

. . . . .

(Read Operands)Issue Instructions

New Instructions(Opcode, Renamed Operands)

Result Bus Muxes

1 2 3 4

FU FU

Result Bus

Bypass Network

PRegResultUpdate

WindowInstruction

Bank BankRF RF

BankRF

BankRF

Figure 6.3: Block Diagram of Scalable Register File

Page 144: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

127

large percentage of registers are bypassed, especially at the first level. Since many operands

are bypassed, a traditional RF for map on decode could reduce the number of read ports

by about 50%. The number of write ports, however, must remain the same. Consequently,

although the size of its RF may be reduced, this does not lead to a scalable solution.

The allocation of registers for results is pipelined. Two cycles before the result will

be ready for writing, write arbitration takes place. The free list of each bank is searched

for an available register. Result buses are assigned to a bank with a free register in a round-

robin process. For example, consider a register file with four banks. In one cycle, three

results are assigned to the first three banks. In the next cycle, the first result bus is assigned

to the fourth bank, and remaining results are assigned starting with the first bank. If a bank

should not have a free register, that bank is skipped in the assigment process.

If there should be more results than available banks, then the pipeline for the func-

tional unit which cannot write its results is stalled. Also, another instruction may be issued

to the functional unit before the instruction window is notified that the functional unit is

stalled. Consequently, two instructions with results may be waiting in the functional unit’s

pipeline. This does not create a problem since already two levels of bypassing exist. If

subsequent instructions require a stalled result, then the result continues to use the bypass

network until it is written to the register file.

In order to be able to allocate registers at result write time and be able to do register

renaming for out-of-order execution, two types of renaming must take place. Before an

instruction is placed into the instruction window, its destination operand is renamed to a

unique instruction tag (itag) and inserted into the mapping table. This contrasts to renaming

the register to a physical register since allocation has not taken place yet. After the physical

register has been allocated for the result, then the instruction window, mapping table, and

Page 145: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

128

Table 6.3: Read Operand Category Distribution (%)

4 Issue 8 Issue

Architecture SDSP SPARC SDSP SPARC

SPEC95 Int Int FP Int Int FP

Bypass Level 1 25.1 19.7 19.4 19.7 13.8 21.5

Bypass Level 2 4.8 4.3 2.6 4.3 2.9 2.3

Read Port 1 13.2 18.1 20.4 18.1 9.6 18.8

Read Port 2 0.8 1.5 5.2 1.5 1.1 3.4

Identical 0.3 0.6 1.6 0.6 1.0 2.5

Zero Value 9.3 9.3 8.7 9.3 13.8 8.7

Imm. Value 32.7 32.7 38.8 32.7 54.5 38.9

No Operand 13.8 13.8 3.3 13.8 3.5 4.0

Failed Read 0.2 0.2 7.3 0.2 7.4 5.1

Page 146: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

129

recovery list need to be notified of the physical register (preg). Using the result’s destination

register identifier, the mapping table is updated, if necessary. Matching itag entries in the

instruction window receive the preg. Entries in the instruction window are already matched

to mark its operand ready, so there is little additional cost involved besides storage cost.

A problem exists with updating the preg in the recovery list. The recovery list may

contain an entry with an invalid old preg because its value is not ready. As a result, there

needs to be some way of finding that entry so it can be updated. The old preg in the recovery

list can be updated matching the itag using CAM cells. A more efficient mechanism would

be to store the next itag in the recovery list. When the entry in the recovery list is marked

complete, the next itag is read, and the preg is written into the entry indexed by the next

itag.

The greatest cost in hardware using dynamic result renaming is the full multiplexer

network used for reading and writing registers. This cost, however, does not begin to

compare to the enormous time and space savings by using a two read port and one write

port register file. The bypass network is still required for map on decode case. In addition,

the bypass network might be reduced by implemented suggestions by Ahuja et al. [2].

6.3.1 Deadlocks

The most critical aspect of using dynamic result renaming is avoiding deadlock sit-

uations. If there are fewer speculative registers than entries in the recovery list, then it

is possible all registers can be allocated with results still pending and create a deadlock

situation. To guarantee a deadlock will not occur, two conditions must exist:

1. The oldest instruction must be able to issue its instruction.

Page 147: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

130

2. The oldest instruction must be able to write its result.

It may occur that the oldest instruction is not able to issue its instruction if the func-

tional unit it requires is stalled and there are no free registers available. Therefore, if this

situation arises, the oldest instruction and only the oldest instruction is permitted to issue

its instruction to a functional unit whose results are stalled and latched into its two stages.

When it completes, the result is delivered to the register file and written using the old phys-

ical register stored in the recovery list (thereby guaranteeing the second condition).

Moreover, if the oldest instruction is unable to issue or complete its instruction, then

the processor temporarily executes in a scalar manner. In addition, each functional unit

must have access to enough banks that equal to at least the number of entries in the recovery

list plus the number of logical registers (in most situations, the entire register file).

6.4 Implementation

This section describes implementation details for hybrid renaming with dynamic re-

sult renaming. First, the structures and procedures used for renaming source operands are

explained. Then the structures and procedures used for renaming destination operands dur-

ing decode and result write time are explained.

6.4.1 Source Operand Renaming

Register renaming can be performed by using a mapping table. A CAM lookup can

also be used to perform hybrid renaming. In addition, intrablock detection of dependent

Page 148: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

131

operands can be used. This section describes some of the implementation requirements for

a CAM lookup, a mapping table. Also, the steps involved for renaming source operands

and destination operands are described.

CAM Lookup

The CAM lookup is composed of a destination register identifier, an instruction tag

field, a physical register field, and a ready field. The name, number of bits required, type of

cell, and description of each field for the CAM lookup is given in Table 6.4. The number

of logical registers is represented by L; the number of speculative instructions allowed is

represented by S; and the number of physical registers is represented by R. For L � ��,

S � ��, and R � �, 5 bits are required for the destination register, 6 bits are required for

the instruction tag, and 7 bits are required for the physical register number.

Table 6.4: CAM Lookup Fields

Name Bits Type Description

Reg log�L CAM Logical destination register identifier

Itag log�S RAM Unique instruction tag

PReg dlog�Re RAM Physical register number

Ready 1 RAM Indicates if PReg field is valid (result has completed)

Mapping Table

The mapping table is composed of an instruction tag field, a physical register field,

and a ready field. The name, number of bits required, and description of each field for the

Page 149: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

132

mapping table is given in Table 6.5. These fields are the same as the CAM lookup, except

the logical register indexes into the table instead of being stored as CAM.

Table 6.5: Mapping Table Fields

Name Bits Description

Itag log�S Unique instruction tag

PReg log�R Physical register number

Ready 1 Indicates if PReg field is valid (result has completed)

Renaming Procedure

The steps performed in renaming a source operand are summarized as follows (refer

to Figure 6.1):

1. Intrablock Detection: Pre-decode bits are used to indicate which instruction within

a block it is dependent, if any. The instruction tag of the instruction it is dependent

upon is returned to the instruction window.

2. CAM Lookup: Remaining operands from intrablock detection are searched for the

most recent matching destination register identifier using content addressable mem-

ory. If the register has its value ready, then the physical register is used for the

operand. Otherwise, the instruction tag is used for the operand.

3. Mapping table: Remaining operands from CAM lookup are indexed into the map-

ping table. If the register has its value ready, the physical register is returned. Other-

wise, the instruction tag is returned to the instruction window.

Page 150: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

133

The intrablock detection is optional. This reduces the number of required ports for

the CAM lookup. In addition, the CAM lookup is not required for renaming operands, but

the mapping table can be used by itself.

Instead of a accessing the CAM lookup and then the mapping table in series, they may

be accessed in parallel. This requires �N ports for the mapping table’s RAM cells, which

would satisfy the worst case possible. A serial access, on the other hand, allows the number

of ports for the mapping table’s RAM cells to be significantly reduced. However, accessing

the CAM lookup and mapping table sequentially may not be possible in one cycle. Most

likely another cycle will be required to read the mapping table for the remaining operands.

The instruction window can tolerate an additional cycle delay in receiving the remaining

renamed operands with no observable impact on performance. This is possible because

the majority of instructions are waiting on its counterpart operand to become ready before

being issued.

6.4.2 Destination Operand Renaming

This section describes the procedure used in renaming a destination operand. The

steps for dynamic result renaming are also listed. The structure used for the recovery list

and free list are detailed.

Recovery List

When a destination register is renamed and entered into the mapping table, the old

mapping is recorded into the recovery list. Table 6.6 lists the name, number of bits, and

description for its three fields: old physical register, next instruction tag, and completed bit.

Page 151: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

134

Table 6.6: Recovery List Fields

Name Bits Description

Old PReg log�S Physical register number of previous instance of the same logical reg

Next Itag log�R Instruction tag of the next instance of the same logical register

Completed 1 Indicates if instruction has completed

Operand Events

The sequence of events a destination operand experiences from decode time until it

commits its result are summarized as follows:

1. Rename the destination logical register identifier to a unique instruction tag. Instruc-

tion tags are numbered sequentially.

2. Enter instruction tag and destination register identifier into CAM lookup.

3. Destination register exits CAM lookup and is entered into the mapping table. Its

logical register identifier is used to index into the mapping table. The entry is read

and used in the next two steps. The new instruction tag, physical register number,

and ready bit are written into that entry.

4. The old physical register from the mapping table entry is written into the recovery

list.

5. The old instruction register identifier is used to index the recovery list and write the

new instruction tag into the Next Itag field.

6. A physical register is allocated when the result is ready. The physical register is writ-

ten to the appropriate entry in either the CAM lookup or the mapping table (explained

in the following section).

Page 152: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

135

7. When the register commits its result, the old physical register number is read from

the recovery entry and then freed.

Dynamic Result Renaming

In order to perform dynamic result renaming to multiple banks of a register file, a

free list must be maintained for each bank. Only a single bit is required to indicate if a

register is allocated. As a result, the storage for this is not complicated. For example, a

RF with 96 registers using 8 banks would require 12 bits per free list of each bank. When

results are committed, the old physical registers are read from the recovery list. A mask

can be generated for each bank according to the registers to be freed. Then the old register

mask for each bank is ORed with the corresponding free list.

The steps involved in renaming a destination register during result write time are

summarized as follows:

1. Two cycles before the result is ready, registers are allocated for all banks by searching

the appropriate free list.

2. If the allocation fails for the bank which the result is writing to, writing of the result

must stall. Otherwise, the result writes to the assigned bank as scheduled.

3. The instruction tag of the result is used to determine if the destination operand has

moved from the CAM lookup to the mapping table.

4. The allocated physical register is written into the appropriate entry in either the CAM

lookup or the mapping table, depending on that determination. The corresponding

ready bit is also set.

Page 153: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

136

5. If the destination operand has been written to the mapping table, an entry in the map-

ping table will exist. The instruction tag of the result is used to read the appropriate

entry in the recovery list and mark the result as completed. This entry provides the

next instruction tag for that register.

6. The allocated physical register number is written to the recovery list entry pointed to

by the next instruction tag, provided that tag is valid.

6.5 Performance

This section compares the performance of dynamic result renaming with renaming

during decoding. The performance is first compared using the instructions per cycle (IPC)

metric and then using the billions of instructions per second (BIPS) metric.

To begin with, the overall performance is compared using the IPC metric. The num-

ber of speculative physical registers is varied. Figure 6.4 and Figure 6.5 show the IPC

for a 4-way and 8-way issue processor, respectively. The map on decode (referred as the

base case) does not need to vary the speculative physical registers because the recovery

list is constant at 32 (or 64) entries and requires a fixed amount. On the other hand, result

renaming can be affected by more or less registers than this amount.

The performance difference from the base case is negligible, except for SPECint95

on the SPARC architecture, where a 5% decrease is observed. Increasing the number of

physical registers reduced this gap for the 8-way issue processor but did not help much for

the 4-way issue processor.

Page 154: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

137

8 16 24 32 40 48Speculative physical registers

1.0

1.5

2.0

2.5

3.0

IPC

SDSP IntSPARC IntSPARC FPSDSP Int baseSPARC Int baseSPARC FP base

1.49 1.53 1.55

1.55

1.55 1.541.57

2.01

2.09 2.11

2.10 2.11

2.112.20

2.20

2.59 2.63 2.63

2.63 2.63

2.63

Figure 6.4: Register File Performance Comparison for a 4-way Superscalar

Page 155: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

138

16 32 48 64 80 96Speculative physical registers

1.0

1.5

2.0

2.5

3.0

3.5

4.0

IPC

SDSP IntSPARC IntSPARC FPSDSP Int baseSPARC Int baseSPARC FP base

1.73 1.781.88

1.90

1.89 1.90

2.54 2.59 2.58

2.60

2.65

2.63

3.18

3.66 3.70 3.70 3.70 3.70

3.70

2.70

1.91

Figure 6.5: Register File Performance Comparison for an 8-way Superscalar

Page 156: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

139

The reason why a slight performance decrease is observed in the SPARC architecture

and not in the SDSP architecture is the difference in the number of logical integer registers.

The ratio of logical to speculative registers in the SDSP is 1:1 while the ratio is 4.25 for the

SPARC (136 integer registers) for a 4-way issue processor. This ratio is cut in half when the

decode width and recovery list are doubled. When there is a large ratio, speculative registers

have a difficult time competing against logical registers in a bank. Logical registers can

pool into one particular bank, thereby restricting its usage. Registers will then tend to be

allocated from a bank with most of the free registers, and functional units will stall since

writing becomes limited. This performance problem can be avoided by reducing the ratio

and/or increasing the number of banks.

With none or a small performance decrease, the number of speculative registers can

be reduced by 25% to 50%. As a result, the total and speculative utilization of the register

file increases. Sometimes the performance can actually increase by decreasing the number

of registers. This is not due to the renaming, but from a slight reduction in the branch

misprediction penalty.

In Section 2.7, it was pointed out that the complexity of the register file increases

the cycle time of the register file. As a result, the cycle time of the processor may have

to increase. By assuming the cycle time of the processor is equal to the cycle time of the

register file, the performance can be more accurately compared. The cycle time of the

register file was estimated using the timing model of Wilson and Jouppi [51] as described

in Section 5.5.2. The register file is similar to a direct-mapped cache without a tag. The

number of ports in the register file was taken into consideration by appropriately increasing

the capacitance of the word and bit lines.

Page 157: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

140

Figure 6.6 and Figure 6.7 show the billions of instructions per second (BIPS) and the

cycle time for a 4-way and 8-way processor, respectively. The billions of instructions per

second (BIPS) metric reflects the cycle time, since it is the product of the IPC and the cycle

time. As with Figure 6.4 and Figure 6.5, the number of speculative registers is varied and

the performance is compared to the base case. The performance of a scalable register file is

significantly larger than the performance of the base case for both the SDSP and SPARC.

The increase in BIPS is approximately 25%, which is outstanding. This can be entirely

attributed to the reduction in the cycle time of the register file. In fact, by reducing the

number of speculative registers, the performance continues to increase because the slight

loss in IPC does not compare to the improvement in cycle time. The reduction in cycle time

for the SDSP is greater than the SPARC because the ratio of logical to speculative registers

is much higher for SPARC. As a result, the percentage reduction of registers is much less

for the SPARC register file.

Page 158: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

141

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

8 16 24 32 40 48

Speculative Physical Registers

BIPS

2.6

2.8

3

3.2

3.4

3.6

3.8

4

4.2

4.4

cycle time (ns)

SDSP Int BIPS

SDSP base Int BIPS

SPARC Int BIPS

SPARC base Int BIPS

SPARC FP BIPS

SPARC base FP BIPS

SDSP cycle time

SDSP base cycle time

SPARC cycle time

SPARC base cycle time

Figure 6.6: BIPS and Cycle Time Performance Comparison for a 4-way Superscalar

Page 159: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

142

0

0.2

0.4

0.6

0.8

1

1.2

1.4

8 16 24 32 40 48

Speculative Physical Registers

BIPS

2.6

2.8

3

3.2

3.4

3.6

3.8

4

4.2

4.4

cycle time (ns)

SDSP Int BIPSSDSP base Int BIPSSPARC Int BIPSSPARC base Int BIPSSPARC FP BIPSSPARC base FP BIPSSDSP cycle timeSDSP base cycle timeSPARC cycle timeSPARC base cycle time

Figure 6.7: BIPS and Cycle Time Performance Comparison for an 8-way Superscalar

Page 160: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Chapter 7

Conclusion

This research related to instruction fetching was motivated by work done from my

Master’s thesis [42]. That work studied different performance aspects for a superscalar

microprocessor and found that instruction fetching significantly hindered overall perfor-

mance. It seemed logical to investigate further to find the cause and the solution.

Interestingly, research into high-performance instruction fetching mechanisms was

ignored until recently. It was ignored for several reasons. First, single and double is-

sue processors were adequately supplied by a simple fetching mechanism. Even with a

four-issue processor, delays from the execution stage result in a reduced demand on the

instruction fetcher. The deficiency in the instruction fetch mechanism still was not evident.

Furthermore, branch prediction was the primary focus, because penalties from mispredic-

tion overshadowed any other fetching loss (instruction cache misses still have a significant

impact). Today, a two-level adaptive branch predictor provides excellent branch prediction,

a next line and set predictor produces good instruction fetch prediction, and instruction

cache hit rates are higher from larger primary caches. As a result, the instruction fetcher of

wide-issue processors with dynamic scheduling fails to supply instructions at an adequate

rate. The performance loss directly related to the instruction fetching mechanism becomes

significant.

143

Page 161: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

144

It is important to first understand the specific problems and limitations of a simple

fetching mechanism. To begin with, a control transfer can jump into a middle of a cache

line, thereby reducing the number of potential instructions. This effect can be mitigated

by using an extended cache line and completely eliminated using a self-aligned cache.

Furthermore, a control transfer disrupts the sequential accessing of instructions, thereby

limiting the number of instructions that can be returned in a single cache line. This fact

places an upper bound on instruction fetching performance for a single block.

To reach the upper bound on instruction fetching, the instruction fetcher must read

more instructions from the instruction cache than the decoder requires. This allows re-

maining instructions to be buffered for later use when a control transfer produces a short

instruction run. Chapter 4 showed prefetching can reach the upper bound in theory, but

at an extreme cost in hardware. Nevertheless, using reasonable hardware, prefetching sig-

nificantly improves fetching performance. To go beyond the limit created by single block

fetching, at least two blocks of instructions must be fetched per cycle. The results prove

that two-block fetching dramatically improves instruction fetching performance.

In order to clearly identify the performance capability of a fetching mechanism, math-

ematical models were presented in Chapter 4. Given the design parameters and the proba-

bility of a control transfer, the expected instruction fetching performance can be calculated.

The models enable the production of graphs that clearly show the relationship between dif-

ferent fetching options without running hundreds of simulations. Also, they can be helpful

in the design of a new superscalar microprocessor to determine which technique will meet

its performance objective. In addition, the maximum performance of a specific fetching

mechanism can be evaluated for unlimited hardware resources.

Page 162: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

145

Multiple conditional branches must be predicted in a single block if any of the poten-

tial performance from the fetching mechanisms in Chapter 4 are to be realized. A scalable

mechanism to predict multiple branches in a block was presented in Chapter 5. It uses a

blocked PHT, which is able to retain the accuracy of a scalar predictor. Furthermore, mul-

tiple blocks must be accurately predicted per cycle to reach the performance potential of

two-block fetching. This is accomplished by predicting the prediction. A select table is

used to retrieve the previous prediction. As a result, two blocks can be accurately predicted

in parallel. The performance increase from two-block fetching dramatically outweighs

prediction penalties.

Dual block prediction can be performed using single selection or double selection.

Single selection always outperforms double selection. On the other hand, double selection

can provide some cost savings, since it does not require a BIT table. The most significant

benefit from double selection is the fast retrieval of the selection bits instead of requiring

computation. In most instances, the prediction using single selection will complete with a

comfortable timing margin. Double selection, though, can be useful in reducing the cycle

time when the prediction is time-critical. For example, when an instruction cache access is

pipelined and completes in two cycles, double selection may yield a lower cycle time than

single selection. This increase in processor speed can outweigh the performance loss of

double selection.

The results in Chapter 5 demonstrate that the instruction fetch mechanism with mul-

tiple block prediction and prefetching can sustain an adequate instruction fetching rate,

but branch mispredictions restrict the effective instruction fetching rate and overall perfor-

mance [23]. Unless the total penalty from branch prediction is correspondingly reduced

with the number of fetch cycles, it is impossible to achieve linear speedup. With the same

branch prediction accuracy of a scalar prediction, at best the number of penalty cycles will

Page 163: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

146

be identical to scalar branch prediction penalties. Usually, though, the number of penalty

cycles increases due to longer pipelines in wider issue processors. Consequently, it is im-

perative that the branch penalty not increase, and, if possible, reduce this penalty.

The last problem this dissertation addressed was the scalability of the register file.

After designing a reorder buffer and instruction window, I realized the implementation im-

plications for a register file of a wide-issue superscalar microprocessor. Each additional

instruction that is issued per cycle requires two more read ports and one more write port.

Since the area is proportional to the square of the number of ports, and the cycle time in-

creases with more ports, the implementation of a large register file can significantly increase

the cycle time and area.

The MIPS R10000 uses a powerful register renaming technique. This technique re-

names registers from a logical register to a physical register when instructions are decoded.

Chapter 6 introduced a new technique which renames registers during result writing instead

of instruction decoding. As a result, multiple scalar register files can be used. The cost of

the register file now scales with the number of instructions issued per cycle. Furthermore,

the register file utilization can increase by reducing the number of physical registers. On

the other hand, renaming at result write time creates a problem with reading registers and

deadlock considerations to avoid, but simulation results show the proposed technique did

not have any significant performance drawbacks compared to mapping during instruction

decoding.

Another benefit of using a scalable register file architecture is that the cycle time can

be close to that of a scalar register file. Consequently, a tremendous increase in instructions

Page 164: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

147

executed per second is observed when the cycle time of the register file is taken into ac-

count. In addition, the performance can continue to increase by decreasing the number of

physical registers.

This dissertation described, modeled, and simulated different scalable instruction

fetching mechanisms. In order to increase fetching beyond the limit of single block fetch-

ing, fetching mechanisms were proposed which perform multiple branch and block predic-

tion. In addition, a scalable register file architecture was presented. All of these designs

strive to be scalable both in cost and performance.

Page 165: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Chapter 8

Future Directions

High-performance instruction fetching mechanisms were presented in this disser-

tation. Still, what changes can be made to further improve performance? The largest

performance penalty from multiple block fetching is conditional branch misprediction.

Significant improvements over the accuracy of two-level adaptive branch prediction is not

likely, since Chen et al. showed that it is already close to optimal [11]. On the other hand,

other types of branch predictors besides a global two-level adaptive branch predictor need

to be researched to determine their effectiveness in predicting multiple branches in a block.

The greatest area for improvement is the reduction of the branch misprediction pen-

alty. The complexity involved from fetching multiple blocks using multiple instruction

cache banks and a prefetch buffer may require an additional pipeline stage. This effectively

increases the misprediction penalty by one cycle. A trace cache can be used to avoid this

penalty. However, instead of using resources for a trace cache, additional buffers can be

used for wrong-path instruction fetching. With a large number of banks, unused banks are

available to fetch the first few blocks from the alternate path once a conditional branch is

encountered. Hence, when a branch is ready to execute, its alternate path is ready to be

decoded the next cycle, should it be mispredicted. This eliminates pipeline bubbles from

the instruction cache stage and any additional instruction alignment stages. In addition,

148

Page 166: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

149

Pierce and Mudge showed that wrong-path instruction prefetching increases the instruction

cache hit and hides the latency of instruction cache misses [31].

The performance penalties from misselection, GHR misprediction, immediate mis-

fetches, and bank conflicts are mitigated by prefetching. Of these factors, misselection is

the largest factor reducing the performance of multiple block fetching. Therefore, addi-

tional research into different selection mechanisms is desirable. Selection accuracy might

be improved by using multiple predictors and choosing the best at run-time, similar to

choosing among multiple branch predictors. In addition, a cost savings in the selection

table can be made by eliminating GHR prediction. This can be accomplished by using a

different index for the select table. Instead of using the current GHR, the GHR from the

previous cycle can be used. Alternatively, the selection index need not be based on the

GHR. A selection history register could record the history of selection bits and be used as

an index into the select table. The savings in GHR misprediction penalties might be more

beneficial from the loss in accuracy from either scheme.

Multithreading is a technique which improves the parallelism available to the execu-

tion unit. In this situation, it is unclear if a sophisticated instruction fetching mechanism

used in a single-threaded superscalar processor is required in a multithreading processor.

Since multiple instruction streams are available, predicting multiple blocks per cycle is

not necessary. On the other hand, should a multithreaded machine begin executing in a

single-threaded fashion, then a multiple block predictor may be beneficial.

The scalable register file architecture presented in Chapter 6 may prove especially

valuable in implementing a multithreaded architecture. Further research is needed to verify

Page 167: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

150

this technique will not degrade performance in the presence of multiple threads. Neverthe-

less, the large savings in area and time provide a significant improvement in cycle time and

performance.

Furthermore, the utilization of the register file can be increased by mapping registers

to the data cache. Work done in [45] successfully mapped registers when using a reorder

buffer for renaming. This should also work well when using a hybrid renaming technique

and dynamic result renaming. Mapping registers to the data cache is viable because only

a relatively small subset of registers are required for immediate use. Other registers not

used recently or only needed in case of an exception could be stored in the data cache.

Consequently, the number of physical registers could be reduced. This may be especially

useful in deep pipelines where a large number of speculative registers is required.

Microprocessors will continue to increase in size by executing multiple instructions

per cycle. Continued research into instruction fetching, execution, and design issues needs

to take place in order to improve performance.

Page 168: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

Bibliography

[1] Anant Agrawal. Ultrasparc: A 64-bit, high-performance sparc processor. In

Proceedings of MicroProcessor Forum, October 1994.

[2] P. Ahuja, D. Clark, and A. Rogers. The performance impact of incomplete

bypassing in processor pipelines. In 28th Annual International Symposium on

Microarchitecture, November 1995.

[3] A. Aiken and Alex Nicolau. Optimal loop parallelization. In ACM SIGPLAN 1988

Conference on Programming Language Design and Implementation, pages 308–317,

Atlanta, Georgia, June 1988.

[4] T. Ball and J. Larus. Branch prediction for free. In 1993 SIGPLAN Conference on

Programming Language Design and Implementation, pages 300–313, June 1993.

[5] Brad Calder and Dirk Grunwald. Fast & accurate instruction fetch and branch pre-

diction. In 21st Annual International Symposium on Computer Architecture, pages

2–11, Chicago, Illinois, April 1994.

[6] Brad Calder and Dirk Grunwald. Reducing branch costs via branch alignment. In

Sixth International Conference on Architectural Support for Programming Languages

and Operating Systems, pages 242–251, October 1994.

[7] Brad Calder and Dirk Grunwald. Next cache line and set prediction. In 22nd Annual

International Symposium on Computer Architecture, pages 287–296, June 1995.

151

Page 169: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

152

[8] Bradley Gene Calder. Hardware and Software Mechanisms for Instruction Fetch

Prediction. PhD thesis, University of Colorado, December 1995.

[9] Andrea Capitanio, Nikil Dutt, and Alexandru Nicolau. Design considerations for

limited connectivity VLIW architectures. TR 92-95, University of California, Irvine,

ICS Dept., 1992.

[10] Andrea Capitanio, Nikil Dutt, and Alexandru Nicolau. Partitioned register files

for VLIWs: A preliminary analysis of tradeoffs. In 25th Annual International

Symposium on Microarchitecture, pages 292–300, Portland, Oregon, December 1992.

[11] I-Cheng K. Chen, John T. Coffey, and Trevor N. Mudge. Analysis of branch pre-

diction via data compression. In Seventh International Conference on Architectural

Support for Programming Languages and Operating Systems, October 1996.

[12] Bob Cmelik and David Keppel. Shade: A fast instruction-set simulator for execution

profiling. In ACM SIGMETRICS, 1994.

[13] Thomas M. Conte, Kishore N. Menezes, Patrick M. Mills, and Burzin A. Patel.

Optimization of instruction fetch mechanisms for high issue rates. In 22nd Annual

International Symposium on Computer Architecture, pages 333–344, June 1995.

[14] Val Popescu et al. Metaflow architecture. IEEE Micro, pages 10–13,63–73, June

1991.

[15] Keith I. Farkas, Norman P. Jouppi, and Paul Chow. Register file design considerations

in dynamically scheduled processors. In Second International Symposium on High-

Performance Computer Architecture, pages 40–51, February 1996.

[16] J. A. Fisher and S. M. Freudenberger. Predicting conditional branch directions from

previous runs of a program. In Fifth International Conference on Architectural

Page 170: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

153

Support for Programming Languages and Operating Systems, pages 85–95, October

1992.

[17] G. F. Grohoski. Machine organization of the IBM RS/6000 processor. IBM Journal

of R&D, 34(1):37–58, January 1990.

[18] Mark A. Horowitz. Timing models for MOS circuits. TR SEL83-003, Integrated

Circuits Laboratory, Stanford University, 1983.

[19] Mike Johnson. Superscalar Microprocessor Design. Prentice Hall, Englewood

Cliffs, 1991.

[20] David R. Kaeli and Philip G. Emma. Branch history table prediction of moving tar-

get branches due to subroutine returns. In 18th Annual International Symposium on

Computer Architecture, pages 34–42, May 1991.

[21] Gerry Kane and Joe Heinrich. MIPS RISC Architecture. Prentice Hall, Englewood

Cliffs, NJ, 1992.

[22] D. J. Kuck, Y. Muraoka, and S. Chen. On the number of operations simultaneously

executable in fortran-like programs and their resulting speedup. IEEE Transactions

on Computers, C-21:1293–1310, December 1972.

[23] Monica S. Lam and Robert P. Wilson. Limits of control flow on parallelism. In 19th

Annual International Symposium on Computer Architecture, pages 46–57, 1992.

[24] Johnny K. F. Lee and Alan J. Smith. Branch prediction strategies and branch target

buffer design. IEEE Computer, pages 6–22, January 1984.

[25] Scott McFarling. Combining branch predictors. TN 36, DEC-WRL, June 1993.

[26] Scott McFarling and John Hennessy. Reducing the cost of branches. In 13th Annual

International Symposium of Computer Architecture, 1986.

Page 171: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

154

[27] Ravi Nair. Optimal 2-bit branch predictors. IEEE Transactions on Computers,

44(5):698–702, May 1995.

[28] A. Nicolau and J. A. Fisher. Measuring the parallelism available for very long instruc-

tion word architectures. IEEE Transactions on Computers, C-33:968–976, November

1984.

[29] Shien-Tai Pan, Kimming So, and Joseph T. Rahmeh. Improving the accuracy of dy-

namic branch prediction using branch correlation. In Fifth International Conference

on Architectural Support for Programming Languages and Operating Systems, pages

76–84, Boston, Massachusetts, October 12–15, 1992.

[30] David B. Papworth. Tuning the pentium pro microarchitecture. IEEE Micro, pages

8–15, April 1996.

[31] Jim Pierce and Trevor Mudge. Wrong-path instruction prefetching. In 29th Annual

International Symposium on Microarchitecture, December 1996.

[32] E. M. Riseman and C. C. Foster. The inhibition of potential parallelism by conditional

jumps. IEEE Transactions on Computers, C-21:1405–1411, December 1972.

[33] Eric Rotenberg, Steve Bennett, and James E. Smith. Trace cache: a low latency

approach to high bandwidth instruction fetching. In 29th Annual International

Symposium on Microarchitecture, December 1996.

[34] Andre Seznec. Don’t use the page number, but a pointer to it. In 23rd Annual

International Symposium on Computer Architecture, pages 104–113, May 1996.

[35] Andre Seznec, Stephan Jourdan, Pascal Sainrat, and Pierre Michaud. Multiple-

block ahead branch predictors. In Seventh International Conference on Architectural

Support for Programming Languages and Operating Systems, October 1996.

Page 172: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

155

[36] J. E. Smith and A. R. Pleszkun. Implementing precise interrupts in pipelined proces-

sors. IEEE Transactions on Computers, C-37:562–573, May 1988.

[37] M. D. Smith, M. Johnson, and M. A. Horowitz. Limits on multiple instruction is-

sue. In Third International Conference on Architectural Support for Programming

Languages and Operating Systems, pages 290–302, April 1989.

[38] S. Peter Song, Marvin Denman, and Joe Chang. The Power PC 604 RISC micropro-

cessor. IEEE Micro, pages 8–17, October 1994.

[39] Marc Tremblay and J. Michael O’Connor. UltraSparc I: A four-issue processor sup-

porting multimedia. IEEE Micro, pages 42–49, April 1996.

[40] Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo,

and Rebecca L. Stamm. Exploiting choice: Instruction fetch and issue on an im-

plementable simultaneous multithreading processor. In 23rd Annual International

Symposium on Computer Architecture, May 1996.

[41] David Wall. Limits of instruction-level parallelism. Technical Report 93/6, Digital

Equipment Corporation, November 1993.

[42] Steven Wallace. Performance analysis of a superscalar architecture. Master’s thesis,

University of California, Irvine, 1993.

[43] Steven Wallace and Nader Bagherzadeh. Performance issues of a superscalar micro-

processor. Microprocessors and Microsystems, 19(4):187–199, May 1995.

[44] Steven Wallace and Nader Bagherzadeh. Instruction fetching mechanisms for super-

scalar microprocessors. In Euro-Par ’96, August 1996.

[45] Steven Wallace and Nader Bagherzadeh. Resource efficient register file architectures.

Technical report, University of California, Irvine, ECE Department, December 1996.

Page 173: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

156

[46] Steven Wallace and Nader Bagherzadeh. A scalable register file architecture for dy-

namically scheduled processors. In Proceedings of the 1996 Conference on Parallel

Architectures and Compilation Techniques, pages 179–184, October 1996.

[47] Steven Wallace and Nader Bagherzadeh. Multiple block and branch prediction.

In Third International Symposium on High-Performance Computer Architecture,

February 1997.

[48] Steven Wallace, Nirav Dagli, and Nader Bagherzadeh. Design and implementation of

a 100 MHz centralized instruction window for a superscalar microprocessor. In 1995

International Conference on Computer Design, October 1995.

[49] David L. Weaver and Tom Germond. The SPARC Architecture Manual, Version 9.

PTR Prentice Hall, Englewood Cliffs, NJ, 1994.

[50] Chih-Po Wen. Improving instruction supply efficiency in superscalar architectures

using instruction trace buffers. In Proceedings of the 1992 ACM/SIGAPP Symposium

on Applied Computing, pages 28–36, 1992.

[51] Steven J. E. Wilton and Norman P. Jouppi. An enhanced access and cycle time model

for on-chip caches. TR 93/5, Digital Equipment Corporation Western Research Lab,

July 1994.

[52] Kenneth C. Yeager. MIPS R10000 superscalar microprocessor. IEEE Micro, pages

28–40, April 1996.

[53] Tse-Yu Yeh. Two-Level Adaptive Branch Prediction and Instruction Fetch Mech-

anisms for High Performance Superscalar Processors. PhD thesis, University of

Michigan, 1993.

Page 174: Scalable Hardware Mechanisms for Superscalar Processorsengineering.uci.edu/~swallace/papers_wallace/pdf/...The dissertation of Steven Daniel Wallace is approved and is acceptable in

157

[54] Tse-Yu Yeh, Deborah T. Marr, and Yale N. Patt. Increasing the instruction fetch rate

via multiple branch prediction and a branch address cache. In 7th ACM International

Conference on Supercomputing, pages 67–76, Tokyo, Japan, July 1993.

[55] Tse-Yu Yeh and Yale N. Patt. Alternative implementations of two-level adap-

tive branch prediction. In 19th Annual International Symposium on Computer

Architecture, pages 124–134, Gold Cost, Australia, May 1992.

[56] Tse-Yu Yeh and Yale N. Patt. A comparison of dynamic branch predictors that use

two levels of branch history. In 20th Annual International Symposium on Computer

Architecture, pages 257–266, San Diego, California, May 1993.