sparc m5™ supplement - oracle cloud...oracle corporation 4150 network circle santa clara, ca 95054...

Orac4150SantU.S.A

Part No.Release 1.0Part No: 950Revision: Dr

SPARC M5™ Supplementto Oracle SPARC Architecture 2011

Draft D0.7, 29 May 2014

Privilege Levels: Privilegedand Nonprivileged

Distribution: Public

le Corporation Network Circlea Clara, CA 95054. 650-960-1300

, 2002-5560-00aft D0.7, 29 May 2014

2 SPARC M5 Supplement • Draft D0.7, 29 May 2014

Copyright © 2011, Oracle and/or its affiliates. All rights reserved.

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. Intel and IntelXeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks orregistered trademarks of SPARC International, Inc. UNIX is a registered trademark licensed through X/Open Company, Ltd.

Contents

1 SPARC M5 Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 SPARC M5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3 SPARC M5 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.1 SPARC Physical Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 L3 Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Floating-Point State Register (FSR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Ancillary State Registers (ASRs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.1 Tick Register (TICK) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Program Counter (PC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.3 Floating-Point Registers State Register (FPRS) . . . . . . . . . . . . . . . . 193.2.4 General Status Register (GSR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.5 Software Interrupt Register (SOFTINT). . . . . . . . . . . . . . . . . . . . . . 193.2.6 System Tick Register (STICK) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.7 System Tick Compare Register (STICK_CMPR). . . . . . . . . . . . . . . 203.2.8 Compatibility Feature Register (CFR) . . . . . . . . . . . . . . . . . . . . . . . 213.2.9 Pause (PAUSE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Privileged PR State Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.1 Trap State Register (TSTATE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.2 Processor State Register (PSTATE) . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.3 Trap Level Register (TL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.4 Current Window Pointer (CWP) Register . . . . . . . . . . . . . . . . . . . . 253.3.5 Global Level Register (GL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Instruction Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Instruction Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1 Instruction Set Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 SPARC M5-Specific Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.3 PREFETCH/PREFETCHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.4 WRPAUSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.5 Block Load and Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.6 Integer Multiply-Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.8 AES Operations (4 operand) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.9 AES Operations (3 operand) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.10 DES Operations (4 operand) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.11 DES Operations (2 operand) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.12 Camellia Operations (4 operand) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.13 Camellia Operations (3 Operand) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5

5.14 Hash Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.15 CRC32C Operation (3 operand) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.16 MPMUL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.17 MONTMUL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.18 MONTSQR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Traps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1 Trap Levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2 Trap Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7 Interrupt Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.1 Interrupt Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.1.1 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.2 CPU Interrupt Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.2.1 Interrupt Queue Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

8 Memory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

8.1 Supported Memory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558.1.1 TSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568.1.2 RMO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

9 Address Spaces and ASIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

9.1 Address Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579.1.1 52-bit Virtual and Real Address Spaces . . . . . . . . . . . . . . . . . . . . . 57

9.2 Alternate Address Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589.2.1 ASI_REAL, ASI_REAL_LITTLE, ASI_REAL_IO, and ASI_REAL_IO_LITTLE 649.2.2 ASI_SCRATCHPAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649.2.3 ASI Accessible Shared Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . 659.2.4 Block Initializing Store ASIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

10 Performance Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

10.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6910.2 SPARC Performance Control Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6910.3 SPARC Performance Instrumentation Counter. . . . . . . . . . . . . . . . . . . . . . . . . . 71

11 Implementation Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

11.1 SPARC V9 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7311.1.1 Level-2 Compliance (Impdep #1). . . . . . . . . . . . . . . . . . . . . . . . . . . 7311.1.2 Unimplemented Opcodes, ASIs, and ILLTRAP . . . . . . . . . . . . . . . 7311.1.3 Trap Levels (Impdep #37, 38, 39, 40, 114, 115) . . . . . . . . . . . . . . . . 7311.1.4 Trap Handling (Impdep #16, 32, 33, 35, 36, 44) . . . . . . . . . . . . . . . 7311.1.5 Secure Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7411.1.6 Address Masking (Impdep #125) . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

11.2 SPARC V9 Integer Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7411.2.1 Integer Register File and Window Control Registers (Impdep #2) 7411.2.2 Clean Window Handling (Impdep #102) . . . . . . . . . . . . . . . . . . . . 7411.2.3 Integer Multiply and Divide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7411.2.4 MULScc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

11.3 SPARC V9 Floating-Point Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7511.3.1 Overflow, Underflow, and Inexact Traps (Impdep #3, 55) . . . . . . 7511.3.2 Quad-Precision Floating-Point Operations (Impdep #3) . . . . . . 7511.3.3 Floating-Point Upper and Lower Dirty Bits in FPRS Register . . 7611.3.4 Floating-Point Status Register (FSR) (Impdep #13, 19, 22, 23, 24) 76

11.4 SPARC V9 Memory-Related Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7711.4.1 Load/Store Alternate Address Space (Impdep #5, 29, 30) . . . . . . 7711.4.2 Read/Write ASR (Impdep #6, 7, 8, 9, 47, 48) . . . . . . . . . . . . . . . . . 7711.4.3 MMU Implementation (Impdep #41) . . . . . . . . . . . . . . . . . . . . . . . 78


11.4.4 FLUSH and Self-Modifying Code (Impdep #122) . . . . . . . . . . . . . 7811.4.5 PREFETCH{A} (Impdep #103, 117). . . . . . . . . . . . . . . . . . . . . . . . . . 7811.4.6 LDD/STD Handling (Impdep #107, 108). . . . . . . . . . . . . . . . . . . . . 7811.4.7 FP mem_address_not_aligned (Impdep #109, 110, 111, 112) . . . . . 7911.4.8 Supported Memory Models (Impdep #113, 121). . . . . . . . . . . . . . . 7911.4.9 Implicit ASI When TL > 0 (Impdep #124) . . . . . . . . . . . . . . . . . . . . 79

11.5 Non-SPARC V9 Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7911.5.1 Cache Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7911.5.2 Block Memory Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7911.5.3 Partial Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7911.5.4 Short Floating-Point Loads and Stores. . . . . . . . . . . . . . . . . . . . . . . 7911.5.5 Load Twin Extended Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8011.5.6 SPARC M5 Instruction Set Extensions (Impdep #106) . . . . . . . . . . 8011.5.7 Performance Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

12 Cryptographic Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

12.1 CFR Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8112.2 Cryptographic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8112.3 Cryptographic performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8112.4 Core S3 Crypto Coding Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

13 Memory Management Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

13.1 Translation Table Entry (TTE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8313.2 Translation Storage Buffer (TSB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8513.3 MMU-Related Faults and Traps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

13.3.1 IAE_privilege_violation Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8613.3.2 IAE_nfo_page Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8613.3.3 DAE_privilege_violation Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8613.3.4 DAE_side_effect_page Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8613.3.5 DAE_nc_page Trap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8613.3.6 DAE_invalid_asi Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8613.3.7 DAE_nfo_page Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8613.3.8 privileged_action Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8713.3.9 This trap occurs when an access is attempted using a restricted ASI while in non-

privileged mode (PSTATE.priv = 0). *_mem_address_not_aligned Traps 8713.4 MMU Operation Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8713.5 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

13.5.1 Instruction Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8913.5.2 Data Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

13.6 Compliance With the SPARC V9 Annex F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9213.7 MMU Internal Registers and ASI Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 92

13.7.1 Accessing MMU Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9213.7.2 Context Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A Programming Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

A.1 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95A.1.1 Instruction fetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95A.1.2 Select/Decode/Rename . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95A.1.3 Pick/Issue/Execute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96A.1.4 Commit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96A.1.5 Context Switching Between Strands. . . . . . . . . . . . . . . . . . . . . . 96A.1.6 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.2 Optimizing for Single-Threaded Performance or Throughput . . . . . . . . . . . 97A.3 Instruction Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97A.4 Coding PAUSE loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

• 7

B IEEE 754 Floating-Point Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

B.1 Special Operand and Result Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

C Differences Between SPARC T4 and SPARC M5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

C.1 Architectural and Microarchitectural Differences. . . . . . . . . . . . . . . . . . . . . . 109C.2 Interrupt Handling Differences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110C.3 Address Spaces and ASIs Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

C.3.1 ASIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110C.3.2 CSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

D Cache Coherency and Ordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

D.1 Cache and Memory Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111D.2 Coherency Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111D.3 Cache Flushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

D.3.1 Displacement Flushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113D.3.2 Memory Accesses and Cacheability . . . . . . . . . . . . . . . . . . . . . 113D.3.3 Coherence Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114D.3.4 Memory Synchronization: MEMBAR and FLUSH . . . . . . . . . 116D.3.5 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117D.3.6 Nonfaulting Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

D.4 L1 I-Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118D.4.1 LFSR Replacement Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 119D.4.2 Direct-Mapped Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119D.4.3 I-Cache Disable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

D.5 L1 D-Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119D.5.1 LRU Replacement Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 119D.5.2 Direct-Mapped Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120D.5.3 D-Cache Disable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

D.6 L2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

E Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

F Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125


CHAPTER 1

SPARC M5 Basics

1.1 BackgroundSPARC M5 is the latest chip multi-threaded (CMT) processor in the M-series processor family. SPARCM5 utilizes the same processor core (Core S3), including private L2 cache, as the SPARC T4 processor,but implements a new SOC which includes a new crossbar, memory controllers, coherency controllers,I/O, and L3 cache shared between the processor cores.

The SPARC M5 product line fulldy implements Sun’s Throughput Computing initiative for thehorizontal system space. Throughput Computing is a technique that takes advantage of the thread-level parallelism that is present in most commercial workloads. Unlike desktop workloads, whichoften have a small number of threads concurrently running, most commercial workloads achieve theirscalability by employing large pools of concurrent threads.

SPARC M5 supports up to an eight way glueless (without external hub chips) coherent system using 7coherence link channels, and up to a ninety-six way glued (with external hub chips) coherent systemusing 6 scalability link channels. SPARC M5 has six SPARC physical processor cores. Each core has fullhardware support for eight strands, two integer execution pipelines, one floating-point executionpipeline, and one memory pipeline. The SPARC cores are connected through a crossbar to an on-chip,unified, 48 MB, 4-banked, 12 way associative L3 cache. There are two on-chip memory controllerssupporting 4 cascadable BoB’s. The BoB chips directly interface to DDR3/DDR4 DIMMs. In addition,there are two on-chip x8 PCI-Express 3.0 I/O interfaces.

Historically, microprocessors have been designed to target desktop workloads, and as a result havefocused on running a single thread as quickly as possible. Single thread performance is achieved inthese processors by a combination of extremely deep pipelines (over 20 stages in Pentium 4) and byexecuting multiple instructions in parallel (referred to as instruction-level parallelism or ILP). Thebasic tenet behind Throughput Computing is that exploiting ILP and deep pipelining has reached thepoint of diminishing returns, and as a result current microprocessors do not utilize their underlyinghardware very efficiently. For many commercial workloads, the processor is idle most of the timewaiting on memory, and even when it is executing it will often be able to only utilize a small fractionof its wide execution width. So rather than building a large and complex ILP processor that sits idlemost of the time, a number of small, single-issue processors that employ multithreading are built inthe same chip area. Combining multiple processors on a single chip with multiple strands per

9

processor provides very high performance for highly threaded commercial applications. Thisapproach is called thread-level parallelism (TLP), and the difference between TLP and ILP is shown inthe FIGURE 1-1.

FIGURE 1-1 Differences Between TLP and ILP

The memory stall time of one strand can often be overlapped with execution of other strands on thesame processor, and multiple processors run their strands in parallel. In the ideal case, shown inFIGURE 1-1, memory latency can be completely overlapped with execution of other strands. In contrast,instruction-level parallelism simply shortens the time to execute instructions and does not help muchin overlapping execution with memory latency.1

Given this ability to overlap execution with memory latency, why don’t more processors utilize TLP?The answer is that designing processors is a mostly evolutionary process, and the ubiquitous deeplypipelined, wide ILP processors of today are the evolutionary outgrowth from a time when theprocessor was the bottleneck in delivering good performance. With processors capable of multipleGHz clocking, the performance bottleneck has shifted to the memory and I/O subsystems, and TLPhas an obvious advantage over ILP for tolerating the large I/O and memory latency prevalent incommercial applications.

Unlike first-generation TLP processors, SPARC M5 seeks to provide the best of TLP and ILPprocessors. In particular, SPARC M5 provides a robust out-of-order, dual-issue processor core that isheavily threaded among eight strands. It has a 16-stage integer pipeline to achieve high operatingfrequencies, advanced branch prediction to mitigate the effect of a deep pipeline, and dynamicallocation of processor resources to threads. This allows SPARC M5 to achieve very high single-threadperformance (about 5x previous CMT processors such as the UltraSPARC T2), while still scaling tovery high levels of throughput.

1.2 SPARC M5 OverviewSPARC M5 is a chip multi-threaded (CMT) processor which supports cache-coherent multi-socketsystems. SPARC M5 contains six SPARC physical processor cores. The SPARC physical cores areconnected through a crossbar to an on-chip unified 48 Mbyte, 12-way associative L3 cache (64-bytelines). The L3 cache is banked twelve ways to provide sufficient bandwidth for the SPARC physicalcores.1. Processors that employ out-of-order ILP can overlap some memory latency with execution. However, this overlap is typically limited

to shorter memory latency events such as L1 cache misses that hit in the L2 cache. Longer memory latency events such as mainmemory accesses are rarely overlapped to a significant degree with execution by an out-of-order processor.

Strand 1

Strand 2

Strand 3

Strand 4

Executing Stalled on Memory

TLP

ILPSingle strandexecuting twoinstructions percycle


1.3 SPARC M5 ComponentsThis section describes each component in SPARC M5.

1.3.1 SPARC Physical CoreEach SPARC physical core has hardware support for eight strands. This support consists of a fullinteger register file with eight register windows per strand, a full floating-point register file perstrand, and nearly all of the ASI, ASR, and privileged registers replicated per strand. The eightstrands share the instruction and data caches.Each SPARC physical core has a 16 KB, 4-way set-associative instruction cache with 32-byte lines, a 16 KB, 4-way set-associative data cache (32-bytelines), a 128KB, 8-way set-associative L2 cache with 32B lines that are shared by the eight strands. The

M5

• 11

L1 data cache is write-through and does not allocate on a write miss; the L2 is store-in and allocateson a write miss. All strands share a floating-point unit incorporating fused multiply-add and VIS3.0instruction support.

The strands share a dual-issue, out-of-order pipeline, divided into two "slots". One instruction can beissued each cycle to each slot. Slot 0 contains an integer unit and a load/store unit, while slot 1contains an integer unit, a branch unit, and a floating-point and graphics unit. Up to two instructionscan complete each cycle for a peak operation rate of two instructions per cycle. The pipeline is bothhorizontally and vertically threaded; various segments of the pipeline handle strands differently. Theinstruction fetch unit fetches instructions from a given strand each cycle. Strands are selected forfetching based upon a least-recently-fetched algorithm. Once fetched, strands are then selected fordecoding in a least-recently-decoded fashion and are then renamed and supplied into an out-of-orderprocessor core. Once inside the out-of-order core, strands are picked for issue independently betweenslots, and in an oldest-ready-first fashion within a slot. Instructions complete out-of-order and arecommitted in-order within a strand, but independently between strands. Up to 128 instructions can bein flight within the processor core, in any combination across the active strands. In certaincircumstances, hardware may activate heuristics to avoid starvation or performance imbalancesresulting from unfair access to hardware resources. The L1 cache load-use latency is 5 cycles, the L2cache load-use latency is 19 cycles, and the L3 load-use latency is 49 cycles.

1.3.1.1 Single-threaded and multi-threaded performance

SPARC M5 is dynamically threaded. While software can activate up to 8 strands on each core at atime, hardware dynamically and seamlessly allocates core resources such as instruction, data, and L2caches, and out-of-order execution resources such as the 128-entry re-order buffer in the core, amongthe active strands.

Since the core dynamically allocates resources among the active strands, there is no explicit "single-thread mode" or "multi-thread mode" for software to activate or deactivate.

The extent to which strands compete for core resources depends upon their execution characteristics.These characteristics include cache footprints, inter-instruction dependencies in their executionstreams, branch prediction effectiveness, and others. Consider one process which has a small cachefootprint and a high correct branch prediction rate which, when running alone on a core, achieves 2instructions per cycle (SPARC M5’s peak rate of instruction execution). We term this a high-IPCprocess. If another process with similar characteristics is activated on a different strand on the samecore, each of the strands will likely operate at approximately 1 instruction per cycle. In other words,the single-thread performance of each process has been cut in half. As a rule of thumb, activating Nhigh-IPC strands will result in each strand executing at 1/N of its peak rate, assuming each strand iscapable of executing close to 2 instructions per cycle.

Now consider a process which is largely memory-bound. Its native IPC will be small, perhaps 0.2. Ifthis process runs on one strand on a core with another clone process running on a different strand,there is a good chance that both strands will suffer no noticeable performance loss, and the corethroughput will improve to 0.4 IPC. If a low-IPC process runs on one strand with a high-IPC processrunning on another strand, it’s likely that the IPC of either strand will not be greatly perturbed. Thehigh-IPC strand may suffer a slight performance degradation (as long as the low-IPC strand does notcause a substantial increase in cache miss rates for the high-IPC strand).

The guidelines above are only general rules-of-thumb. The extent to which one strand affects anotherstrand’s performance depends upon many factors. Processes which run fine on their own but sufferfrom destructive cache interference when run with other strands may suffer unacceptableperformance losses. Similarly, it is also possible for strands to cooperatively improve performancewhen run together. This may occur when the strands running on one core share code or data. In thiscase, one strand may prefetch instructions or data that other strands will use in the near future.

The same discussion can apply between cores running in the chip. Since the L3 cache and memorycontrollers are shared between the cores, activity on one core can influence the performance of strandson another core.


1.3.2 L3 CacheThe L3 cache is banked four ways. It is inclusive of all chip-local L2 caches. To provide for betterpartial-die recovery, SPARC M5 can also be configured in 2-bank modes (with 1/2 the total cachesize). Bank selection is based on address bits 7:6 for 4 banks, and bit 6 for 2 banks. The cache is 48Mbytes, and 12-way set associative. The line size is 64 bytes.

• 13

CHAPTER 2

Data Formats

Data formats supported by SPARC M5 are described in the Oracle SPARC Architecture 2011specification.

15

CHAPTER 3

Registers

3.1 Floating-Point State Register (FSR)Each virtual processor has a Floating-Point State register. This register follows the Oracle SPARCArchitecture 2011 specification, with the ver and qne fields permanently set to 0 (SPARC M5 does notsupport a FQ).

For more information on this register, see the Oracle SPARC Architecture 2011 specification.

3.2 Ancillary State Registers (ASRs)This chapter discusses the SPARC M5 ancillary state registers. TABLE 3-1 summarizes and defines theseregisters.

TABLE 3-1 Summary of SPARC M5 Ancillary State Registers

ASRnumber ASR Name Access priv Description

0 Y RW N Y Register

1 Reserved — Any access causes a illegal_instructiontrap

2 CCR RW N Condition Code register

3 ASI RW N ASI register

4 TICK RO Y1 TICK register

5 PC RO2 N Program counter

6 FPRS RW N Floating-Point Registers Status register

07 - 14 Reserved - Any access causes an illegal_instructiontrap

15 (MEMBAR, STBAR) — N Instruction opcodes only, not an actualASR.

16 - 18 Reserved — Any access causes an illegal_instructiontrap

19 GSR RW N General Status register

20 SOFTINT_SET W Y4 Set bit in Soft Interrupt register

21 SOFTINT_CLR W Y4 Clear bit in Soft Interrupt register

17

Notes:

1. An attempted write by nonprivileged software to this register causes a privileged_opcode trap.Anattempted write by privileged software to this register causes an illegal_instruction trap. See theOracle SPARC Architecture 2011 specification for more detail.

2. A write to this register causes an illegal_instruction trap.

3. An attempted access in nonprivileged mode causes a privileged_opcode trap.

4. Read accesses cause an illegal_instruction trap. An attempted write access in nonprivileged modecauses a privileged_opcode trap.

5. A write by privileged or user software causes an illegal_instruction trap. See the Oracle SPARCArchitecture 2011 specification for more detail.

6. Reads are nonprivileged. A write by privileged or user software causes an illegal_instruction trap.

3.2.1 Tick Register (TICK)The TICK register contains one field: counter. The counter field is shared by the eight strands on aphysical core.The counter increments each processor core clock.The format of this register is shown inTABLE 3-2.


3.2.2 Program Counter (PC)Each strand has a read-only program counter register. The PC contains a 52-bit virtual address andVA{63:52} is sign-extended from VA{51}. The format of this register is shown in TABLE 3-3.

22 SOFTINT RW Y3 Soft Interrupt register

23 Reserved — Any access causes an illegal_instructiontrap

24 STICK RW Y5 System Tick register

25 STICK_CMPR RW Y3 System TICK Compare register

26 CFR RO6 Y Compatibility Feature Register

27 PAUSE W N Any read causes an illegal_instructiontrap; PAUSE is write-only

28 - 31 Reserved — Any access causes an illegal_instructiontrap

TABLE 3-2 TICK Register – TICK (ASR 0416)

Bit Field RW Description

63 — RO Reserved

62:0 counter RW Tick counter, increments each processor core clock cycle.

TABLE 3-1 Summary of SPARC M5 Ancillary State Registers (Continued)

ASRnumber ASR Name Access priv Description


3.2.3 Floating-Point Registers State Register (FPRS)This register is described in Oracle SPARC Architecture 2011.

3.2.4 General Status Register (GSR)Each virtual processor has a nonprivileged general status register (GSR). When PSTATE.pef orFPRS.fef is zero, accesses to this register cause an fp_disabled trap.


3.2.5 Software Interrupt Register (SOFTINT)Each virtual processor has a privileged software interrupt register. Nonprivileged accesses to thisregister cause a privileged_opcode trap. The SOFTINT register contains two fields: sm, and int_level.Note that while setting the sm (bit 16) or SOFTINT{14} bits generate interrupt_level_14, these bits areconsidered completely independent of each other. Thus an STICK compare will only set bit 16 andgenerate interrupt_level_14, not also set bit 14.

TABLE 3-4 specifies how interrupt_level_14 is shared between SOFTINT writes and STICK compares.


3.2.6 System Tick Register (STICK)Each SPARC M5 physical processor core implements an STICK register, shared by all strands of thatcore.

TABLE 3-3 Program Counter – PC (ASR 0516)

Bit Field R/W Description

63:52 va_high RO Sign-extended from VA{51}.

51:2 va RO Virtual address contained in the program counter.

1:0 — RO The lower 2 bits of the program counter always read as 0.

ImplementationNote

SPARC M5 sets FPRS.du or FPRS.dl when an instruction thatupdates the floating-point register file successfully completes, orwhen an FMOVcc or FMOVr instruction that does not satisfy thedestination register update condition successfully completes.

TABLE 3-4 Sharing of interrupt_level_14

Event SOFTINT{14} sm Action

STICK compare when sm = 0 Unchanged 1 interrupt_level_14 ifPSTATE.ie = 1 and PIL < 14

Set sm = 1 when sm = 0 Unchanged 1 interrupt_level_14 ifPSTATE.ie = 1 and PIL < 14

Set SOFTINT{14} = 1 whenSOFTINT{14} = 0.

1 Unchanged interrupt_level_14 ifPSTATE.ie = 1 and PIL < 14

• 19

Privileged software can read the STICK register with the RDSTICK instruction.Privileged softwarecannot write the STICK register; an attempt by privileged software to execute the WRSTICKinstruction results in an illegal_instruction exception.

Nonprivileged software can read the STICK register with RDSTICK instruction.Nonprivilegedsoftware cannot write the STICK register; an attempt by nonprivileged software to execute theWRSTICK instruction results in an illegal_instruction exception.

In SPARC M5, the difference of the values of two different reads of the STICK register reflects theamount of time that has passed between the reads;

(value2 - value1) * 1 = the number of nanoseconds that passed between the reads.


3.2.7 System Tick Compare Register (STICK_CMPR)Each virtual processor has a privileged System Tick Compare (STICK_CMPR) register. Nonprivilegedaccesses to this register cause a privileged_opcode exception. STICK_CMPR contains two fields:int_dis and stick_cmpr. Only bits 62:7 of the stick_cmpr field are compared against the STICK counterfield.

The int_dis bit controls whether a STICK interrupt_level_14 interrupt is posted in the SOFTINTregister when STICK_CMPR bits 62:7 match STICK bits 62:7. The format of this register is shown inTABLE 3-6.

After a power-on reset trap, STICK_CMPR.int_dis is set to 1 and STICK_CMPR.cmpr is undefined.

An stick_match exception occurs in the cycle in which all of the following three conditions are met:

1. STICK_CMPR.int_dis == 0.

2. A transition occurs from

(STICK.counter)[62:7] < STICK_CMPR.cmpr[62:7]

in one cycle, to

(STICK.counter)[62:7] >= STICK_CMPR.cmpr[62:7]

in the following cycle

3. This transition of state occurs due to incrementing STICK, and not due to writing STICK orSTICK_CMPR

TABLE 3-5 System Tick Register – STICK (ASR 1816)


63 — RO Reserved.

62:0 stick RW Elapsed time value, measured in increments of 1 nS.

TABLE 3-6 System Tick Compare Register – STICK_CMPR (ASR 1916)


63 int_dis RW stick_int interrupt disable. If 1, stick_int interrupt generationis disabled.

62:7 stick_cmpr RW Compare value for stick_int interrupts.

6:0 — RO Reserved.


When an stick_match interrupt occurs, SOFTINT{16} (sm) is set to 1. This has the effect of posting aninterrupt_level_14 trap request to the virtual processor, which causes an interrupt_level_14 trap when(PIL < 14) and (PSTATE.ie == 1). The interrupt_level_14 trap handler must check SOFTINT{14} andSOFTINT{16} (sm) to determine the cause of the interrupt_level_14 trap.


3.2.8 Compatibility Feature Register (CFR)For general information on this register, see the Oracle SPARC Architecture 2011 specification.

Each virtual processor has a compatibility feature register (CFR). The CFR is read-only. The format ofthe CFR is shown in Table 3-7 .

TABLE 3-7 Compatibility Feature Register – CFR (ASR 1A16)


63:12 — RO Reserved

11 crc32c RO If set, the processor supports the CRC32C opcode. If not set, anattempt to execute a CRC32C instruction results in acompatibility_feature trap.

10 montsqr RO If set, the processor supports the MONTSQR opcode. If not set, anattempt to execute a MONTSQR instruction results in acompatibility_feature trap.

9 montmul RO If set, the processor supports the MONTMUL opcode. If not set, anattempt to execute a MONTMUL instruction results in acompatibility_feature trap.

8 mpmul RO If set, the processor supports the MPMUL opcode. If not set, anattempt to execute an MPMUL instruction results in acompatibility_feature trap.

7 sha512 RO If set, the processor supports the SHA512 opcode. If not set, anattempt to execute a SHA512 instruction results in acompatibility_feature trap.

6 sha256 RO If set, the processor supports the SHA256 opcode. If not set, anattempt to execute a SHA256 instruction results in acompatibility_feature trap.

5 sha1 RO If set, the processor supports the SHA1 opcode. If not set, an attemptto execute a SHA1 instruction results in a compatibility_feature trap.

4 md5 RO If set, the processor supports the MD5 opcode. If not set, an attemptto execute an MD5 instruction results in a compatibility_feature trap.

3 camellia RO If set, the processor supports Camellia opcodes (CAMELLIA_F,CAMELLIA_FL, and CAMELLIA_FLI). If not set, an attempt toexecute a Camellia instruction results in a compatibility_feature trap.

2 kasumi RO If set, the processor supports Kasumi opcodes (KASUMI_FL_XOR,KASUMI_FI_XOR, and KASUMI_FI_FI). If not set, an attempt toexecute a Kasumi instruction results in a compatibility_feature trap.

1 des RO If set, the processor supports DES opcodes (DES_ROUND, DES_IP,DES_IIP, and DES_KEXPAND). If not set, an attempt to execute aDES instruction results in a compatibility_feature trap.

0 aes RO If set, the processor supports AES opcodes (AES_EROUND01,AES_EROUND23, AES_DROUND01, AES_DROUND23,AES_EROUND_01_LAST, AES_EROUND_23_LAST,AES_DROUND_01_LAST, AES_DROUND_23_LAST,AES_KEXPAND0, AES_KEXPAND1, and AES_KEXPAND2). If notset, an attempt to execute an AES instruction results in acompatibility_feature trap.

• 21

The CFR enumerates the capabilities that SPARC M5 supports. While the current definition of theCFR only relates to cryptographic capability, additional capabilities may be added in futureprocessors. Software can use the CFR to determine whether a set of cryptographic opcodes associatedwith a cryptographic function can be executed on an instance of SPARC M5. Hardware also uses theCFR to determine whether a cryptographic capability associated with an opcode is present. WhenSPARC M5 executes a cryptographic opcode, it associates a bit in the CFR with each opcode; the bitmust be set, otherwise a compatibility_feature trap occurs.

The CFR allows software to construct an architecture that enables opcode reuse. A completediscussion is outside the scope of this document; however, a brief overview follows.

Consider the situation where a processor is introduced that supports three cryptographic opcodes:opA, opB, and opC. Cryptographic requirements could be such that opA=AES, opB=DES, andopC=Kasumi. Traditionally, for any derivative or next-generation processor for which differentciphers were of interest, it would be necessary to expend additional opcodes to achieve the necessarysupport: e.g. opD=Camellia, opE=MD5. OpA, OpB, and OpC would still be consumed in these follow-on processors, even if there was no longer any interest in the AES, DES, and Kasumi algorithms.

In conjunction with appropriate software architecture and infrastructure, the CFR enables opcodereuse by future processor generations when cryptographic algorithms become obsolete. Potentialaliasing problems are disambiguated using the CFR. Each bit in the CFR is permanently assigned to adifferent cryptographic operation. For instance, bits 0, 1, and 2 are assigned to AES, DES, and Kasumifamily opcodes, as shown above. The mapping in the CFR is fixed for all future and derivativeprocessors. When an application wishes to perform an AES operation, it registers that request usingthe appropriate software architectural means, and uses opA in its binary. Prior to executing, systemsoftware or the application checks to make sure that the target processor binds the AES function toopA. It does so by examining the CFR to see if bit 0 is set. If so, the program executes using nativeAES instructions (opA); if not, system software and/or the application must support a non-nativeAES instruction implementation using standard instructions. It is expected that cryptographiclibraries will contain the necessary checking, so hardware cryptographic support will be transparentto applications that perform cryptographic operations using cryptographic library calls. If theapplication does not use cryptographic libraries, it should check the CFR to make sure that hardwaresupports the appropriate function, otherwise it should emulate the function using standardinstructions. Alternatively, if performance is not critical, it may rely on trap-and-emulate supportprovided by higher-level system software.

When the first generation of processor (G1) executes an AES opcode it checks that CFR bit 0 is set. Ifso, the hardware performs the requested AES operation. Accordingly, on G1, an application is free toperform AES operations using opA. Similar enforcement is applied to DES and Kasumi, respectively.

Now consider what happens if the application is moved to a future processor (G2) which has re-usedopA to provide support for Camellia; i.e. opA=Camellia. When system software checks thecapabilities for the program, or the program checks, it will see that G2 does not support AES usingopA (CFR bit 0 will be 0). This allows system software or the application to emulate AES supportusing standard instructions. Note that if the application somehow runs without this check havingbeen performed and issues opA, the G2 processor will examine the CFR bit for Camellia, and if set,the application will execute, and get erroneous results (Camellia instead of AES). A similar problemexists if the application is developed for G2 hardware, but somehow runs on a G1 processor. Thus itis vital that system software and/or the application appropriately register their intent and checkhardware capability prior to executing cryptographic opcodes.

As a result, given appropriate software infrastructure, instruction set designers may reuse opcodes toperform a variety of different operations and applications will continue to see the expected results ondifferent generation platforms.

ProgrammingNote

For optimal performance, prior to using instruction-levelcryptographic functions, applications and libraries should firstcheck the CFR to ensure that the desired algorithm is supportedby the hardware.


3.2.9 Pause (PAUSE)SPARC M5 physically implements a 12-bit PAUSE register in bits 11:0. The value written to the PAUSEregister via the WRPAUSE instruction is an unsigned 15-bit value that is then right-shifted by 3 bits(divided by 8) since hardware decrements the PAUSE register once every 8 cycles. Thus the unsigned12-bit value represents a cycle count from 0 to a maximum of 32760 cycles. Writing to the non-privileged PAUSE register stalls a thread for the number of cycles specified by the XOR of the sourceoperands, except as follows:

1. Writing 0 to the PAUSE register stalls the thread for the minimum number of cycles (greater thanzero since there is a minimum stall time due to internal pipeline delays).

2. Writing a value larger than 215 - 1 causes hardware to saturate the 12-bit PAUSE register; hardwaresets PAUSE to FFF16 prior to decrementing it.

While the PAUSE register is nonzero, no instructions are selected from the strand issuing aWRPAUSE. An unmasked disrupting exception terminates the PAUSE.

For more information on this instruction, see the Oracle SPARC Architecture 2011 specification orSection 5.4, WRPAUSE, on page 36.

3.3 Privileged PR State RegistersTABLE 3-8 lists the privileged registers.

TABLE 3-8 Privileged Registers

Register Register Name Access Description

0 TPC RW Trap PC1

1. SPARC M5 only implements bits 51:0 of the TPC, TNPC, and TBA reg-isters. Bits 63:52 are always sign-extended from bit 51.

1 TNPC RW Trap Next PC1

2 TSTATE RW Trap State

3 TT RW Trap Type

4 TICK RW Tick

5 TBA RW Trap Base Address1

6 PSTATE RW Process State

7 TL RW Trap Level

8 PIL RW Processor Interrupt Level

9 CWP RW Current Window Pointer

10 CANSAVE RW Savable Windows

11 CANRESTORE RW Restorable Windows

12 CLEANWIN RW Clean Windows

13 OTHERWIN RW Other Windows

14 WSTATE RW Window State

16 GL RW Global Level

• 23

3.3.1 Trap State Register (TSTATE)Each virtual processor has MAXPTL (2) Trap State registers. These registers hold the state values fromthe previous trap level. The format of one element the TSTATE register array (corresponding to onetrap level) is shown in TABLE 3-9.


3.3.2 Processor State Register (PSTATE)Each virtual processor has a Processor State register. More details on PSTATE can be found in theOracle SPARC Architecture 2011 specification. The format of this register is shown in TABLE 3-10; notethat the memory model selection field (mm) mentioned in Oracle SPARC Architecture 2011 is notimplemented in SPARC M5.

TABLE 3-9 Trap State Register


63:42 — RO Reserved.

41:40 gl RW Global level at previous trap level

39:32 ccr RW CCR at previous trap level

31:24 asi RW ASI at previous trap level


20 pstate tct RW PSTATE.tct at previous trap level

19:18 — RO Reserved (corresponds to bits 11:10 of PSTATE)

17 pstate cle RW PSTATE.cle at previous trap level

16 pstate tle RW PSTATE.tle at previous trap level

15:13 — RO Reserved (corresponds to bits 7:5 of PSTATE)

12 pstate pef RW PSTATE.pef at previous trap level

11 pstate am RW PSTATE.am at previous trap level

10 pstate priv RW PSTATE.priv at previous trap level

9 pstate ie RW PSTATE.ie at previous trap level

8 — RO Reserved (corresponds to bit 0 of PSTATE)

7:3 — RO Reserved

2:0 cwp RW CWP from previous trap level

TABLE 3-10 Processor State Register



12 tct RW Trap on control transfer


9 cle RW Current little endian

8 tle RW Trap little endian

7:6 — RO Reserved (mm; not implemented in SPARC M5)

5 — RO Reserved

4 pef RW Enable floating-point

3 am RW Address mask

2 priv RW Privileged mode

1 ie RW Interrupt enable

0 — RO Reserved (was ag)



3.3.3 Trap Level Register (TL)Each virtual processor has a Trap Level register. Writes to this register saturate at MAXPTL (2). Thissaturation is based on bits 2:0 of the write data; bits 63:3 of the write data are ignored.


3.3.4 Current Window Pointer (CWP) RegisterSince N_REG_WINDOWS = 8 on SPARC M5, the CWP register in each virtual processor is implementedas a 3-bit register.


3.3.5 Global Level Register (GL)Each virtual processor has a Global Level register, which controls which set of global registerwindows is in use. The maximum global level (MAXPGL) for SPARC M5 is 2, so GL is implemented asa 2-bit register on SPARC M5. On a trap, GL is set to min(GL + 1,MAXPTL).

Writes to the GL register saturate at MAXPTL. This saturation is based on bits 1:0 of the write data; bits63:2 of the write data are ignored.

The format of the GL register is shown in TABLE 3-11.


ProgrammingNote

Hyperprivileged changes to translation in delay slots of delayedcontrol transfer instructions should be avoided.

TABLE 3-11 Global Level Register



1:0 gl RW Global level.

• 25

CHAPTER 4

Instruction Formats

Instruction formats are described in the Oracle SPARC Architecture 2011 specification.

27

CHAPTER 5

Instruction Definitions

5.1 Instruction Set SummaryThe SPARC M5 CPU implements the Oracle SPARC Architecture 2011 instruction set.

TABLE 5-1 lists the complete SPARC M5 instruction set supported in hardware. All instructions thatare part of Oracle SPARC Architecture 2011 are documented in the Oracle SPARC Architecture 2011specification; any instructions that are extensions to OSA 2011 are documented in this chapter..

TABLE 5-1 Complete SPARC M5 Hardware-Supported Instruction Set (1 of 6)

Opcode Description

ADD (ADDcc) Add (and modify condition codes)

ADDC (ADDCcc) Add with carry (and modify condition codes)

ADDXC (ADDXCcc) Add extended with carry (and modify condition codes)

AES_DROUND01 AES decrypt round, columns 0 & 1

AES_DROUND23 AES decrypt round, columns 2 & 3

AES_DROUND01_LAST AES decrypt last round, columns 0 & 1

AES_DROUND23_LAST AES decrypto last round, columns 2 & 3

AES_EROUND01 AES encrypt round, columns 0 & 1

AES_EROUND23 AES encrypt round, columns 2 & 3

AES_EROUND01_LAST AES encrypt last round, columns 0 & 1

AES_EROUND23_LAST AES encrypt last round, columns 2 & 3

AES_KEXPAND0 AES key expansion without round constant

AES_KEXPAND1 AES key expansion with round constant

AES_KEXPAND2 AES key expansion without SBOX

ALIGNADDRESS Calculate address for misaligned data access

ALIGNADDRESSL Calculate address for misaligned data access (little-endian)

ALLCLEAN Mark all windows as clean

AND (ANDcc) And (and modify condition codes)

ANDN (ANDNcc) And not (and modify condition codes)

ARRAY{8,16,32} 3-D address to blocked byte address conversion

Bicc Branch on integer condition codes

BMASK Writes the GSR.mask field

BPcc Branch on integer condition codes with prediction

BPr Branch on contents of integer register with prediction

BSHUFFLE Permutes bytes as specified by the GSR.mask field

CALL1 Call and link

CAMELLIA_F Camellia F operation

CAMELLIA_FL Camellia FL operation

29

CAMELLIA_FLI Camellia FLI operation

CASA Compare and swap word in alternate space

CASXA Compare and swap doubleword in alternate space

C{W,X}Bcond Fused 32 or 64 bit compare and conditional branch

CMASK{8,16,32} Create GSR.maskfrom SIMD operation result

CRC32C CRC32C polynomial instruction

DES_IP DES initial permutation

DES_IIP DES inverse initial permutation

DES_KEXPAND DES key expansion

DES_ROUND DES round

DONE Return from trap

EDGE{8,16,32}{L}{N} Edge boundary processing {little-endian} {non-condition-code altering}

FABS(s,d) Floating-point absolute value

FADD(s,d) Floating-point add

FALIGNDATA Perform data alignment for misaligned data

FANDNOT1{s} Negated src1 and src2 (single precision)

FANDNOT2{s} Src1 and negated src2 (single precision)

FAND{s} Logical and (single precision)

FBPfcc Branch on floating-point condition codes with prediction

FBfcc Branch on floating-point condition codes

FCHKSM16 16-bit partitioned checksum

FCMP(s,d) Floating-point compare

FCMPE(s,d) Floating-point compare (exception if unordered)

FCMPEQ{16,32} Four 16-bit / two 32-bit compare: set integer dest if src1 = src2

FCMPGT{16,32} Four 16-bit / two 32-bit compare: set integer dest if src1 > src2

FCMPLE{16,32} Four 16-bit / two 32-bit compare: set integer dest if src1 ≤ src2

FCMPNE{16,32} Four 16-bit / two 32-bit compare: set integer dest if src1 ≠ src2

FDIV(s,d) Floating-point divide

FEXPAND Four 8-bit to 16-bit expand

FHADD{s,d} Floating-point add and halve

FHSUB{s,d} Floating-point subtract and halve

FiTO(s,d) Convert integer to floating-point

FLUSH Flush instruction memory

FLUSHW Flush register windows

FLCMP{s,d} Lexicographic compare

FMADD{s,d} Floating-point multiply-add single/double (fused)

FMEAN16 16-bit partitioned average

FMOV(s,d) Floating-point move

FMOV(s,d)cc Move floating-point register if condition is satisfied

FMOV(s,d)R Move floating-point register if integer register contents satisfy condition

FMSUB{s,d} Floating-point multiply-subtract single/double (fused)

FMUL(s,d) Floating-point multiply

FMUL8SUX16 Signed upper 8- x 16-bit partitioned product of corresponding components

FMUL8ULX16 Unsigned lower 8- x 16-bit partitioned product of corresponding components

FMUL8X16 8- x 16-bit partitioned product of corresponding components

FMUL8X16AL Signed lower 8- x 16-bit lower α partitioned product of four components

FMUL8X16AU Signed upper 8- x 16-bit lower α partitioned product of four components


Opcode Description


FMULD8SUX16 Signed upper 8- x 16-bit multiply → 32-bit partitioned product of components

FMULD8ULX16 Unsigned lower 8- x 16-bit multiply → 32-bit partitioned product of components

FNADD(s,d) Floating-point add and negate

FNAND{s} Logical nand (single precision)

FNEG(s,d) Floating-point negate

FNHADD{s,d} Floating-point add and halve, then negate

FNMADD{s,d} Floating-point multiply-add and negate

FNMSUB{s,d} Floating-point negative multiply-subtract single/double (fused)

FNMUL{s,d} Floating-point multiply and negate

FNsMULd Floating-point multiply and negate

FNOR{s} Logical nor (single precision)

FNOT1{s} Negate (1’s complement) src1 (single precision)

FNOT2{s} Negate (1’s complement) src2 (single precision)

FONE{s} One fill (single precision)

FORNOT1{s} Negated src1 or src2 (single precision)

FORNOT2{s} src1 or negated src2 (single precision)

FOR{s} Logical or (single precision)

FPACKFIX Two 32-bit to 16-bit fixed pack

FPACK{16,32} Four 16-bit/two 32-bit pixel pack

FPADD{16,32}{s} Four 16-bit/two 32-bit partitioned add (single precision)

FPADD64 Fixed-point partitioned add

FPADDS{16,32}{s} Fixed-point partitioned add

FPMADDX Unsigned integer multiply-add

FPMADDXHI Unsigned integer multiply-add, return high-order 64 bits of result

FPMERGE Two 32-bit to 64-bit fixed merge

FPSUB{16,32}{s} Four 16-bit/two 32-bit partitioned subtract (single precision)

FPSUB64 Fixed-point partitioned subtract, 64-bit

FPSUBS{16,32}{s} Fixed-point partitioned subtract

FSLL{16,32} 16- or 32-bit partitioned shift, left (old mnemonic FSHL)

FSLAS{16,32} 16- or 32-bit partitioned shift, left or right (old mnemonic FSHLAS)

FSRA{16,32} 16- or 32-bit partitioned shift, left or right (old mnemonic FSHRA)

FSRL{16,32} 16- or 32-bit partitioned shift, left or right (old mnemonic FSHRL)

FsMULd Floating-point multiply single to double

FSQRT(s,d) Floating-point square root

FSRC1{s} Copy src1 (single precision)

FSRC2{s} Copy src2 (single precision)

F(s,d)TO(s,d) Convert between floating-point formats

F(s,d)TOi Convert floating point to integer

F(s,d)TOx Convert floating point to 64-bit integer

FSUB(s,d) Floating-point subtract

FUCMP{GT,LE,NE,EQ}8 Compare 8-bit unsigned fixed-point values

FXNOR{s} Logical xnor (single precision)

FXOR{s} Logical xor (single precision)

FxTO(s,d) Convert 64-bit integer to floating-point

FZERO{s} Zero fill (single precision)

ILLTRAP Illegal instruction

INVALW Mark all windows as CANSAVE


Opcode Description

• 31

JMPL Jump and link

KASUMI_FI_XOR Kasumi FI followed by XOR

KASUMI_FI_FI Kasumi FI followed by FI

KASUMI_FL_XOR Kasumi FL followed by XOR

LDBLOCKF 64-byte block load

LDDF Load double floating-point

LDDFA Load double floating-point from alternate space

LDF Load floating-point

LDFA Load floating-point from alternate space

LDFSR Load floating-point state register lower

LDSB Load signed byte

LDSBA Load signed byte from alternate space

LDSH Load signed halfword

LDSHA Load signed halfword from alternate space

LDSTUB Load-store unsigned byte

LDSTUBA Load-store unsigned byte in alternate space

LDSW Load signed word

LDSWA Load signed word from alternate space

LDTW Load twin words

LDTWA Load twin words from alternate space

LDUB Load unsigned byte

LDUBA Load unsigned byte from alternate space

LDUH Load unsigned halfword

LDUHA Load unsigned halfword from alternate space

LDUW Load unsigned word

LDUWA Load unsigned word from alternate space

LDX Load extended

LDXA Load extended from alternate space

LDXFSR Load extended floating-point state register

LDXEFSR Load extended floating-point state register

LZD Leading zero detect on 64-bit integer register

MD5 MD5 hash

MEMBAR Memory barrier

MONTMUL Montgomery multiplication

MONTSQR Montgomery squaring

MOVcc Move integer register if condition is satisfied

MOVr Move integer register on contents of integer register

MOVdTOx Move floating-point register to integer register

MOVsTO{d,s}w Move floating-point register to integer register

MOV{x,w}TO{d,s} Move integer register to floating-point register

MPMUL Multiple-precision multiplication

MULScc Multiply step (and modify condition codes)

MULX Multiply 64-bit integers

NOP No operation

NORMALW Mark other windows as restorable

OR (ORcc) Inclusive-or (and modify condition codes)

ORN (ORNcc) Inclusive-or not (and modify condition codes)


Opcode Description


OTHERW Mark restorable windows as other

PDIST Distance between 8 8-bit components

PDISTN Pixel component distance

POPC Population count

PREFETCH Prefetch data

PREFETCHA Prefetch data from alternate space

PST Eight 8-bit/4 16-bit/2 32-bit partial stores

RDASI Read ASI register

RDASR Read ancillary state register

RDCCR Read condition codes register

RDCFR Read compatibility feature register

RDFPRS Read floating-point registers state register

RDPC Read program counter

RDPR Read privileged register

RDTICK Read TICK register

RDY Read Y register

RESTORE Restore caller’s window

RESTORED Window has been restored

RETRY Return from trap and retry

RETURN Return

SAVE Save caller’s window

SAVED Window has been saved

SDIV (SDIVcc) 32-bit signed integer divide (and modify condition codes)

SDIVX 64-bit signed integer divide

SETHI Set high 22 bits of low word of integer register

SHA1 SHA-1 hash

SHA256 SHA-256 hash

SHA512 SHA-512 hash

SIAM Set interval arithmetic mode

SLL Shift left logical

SLLX Shift left logical, extended

SMUL (SMULcc) Signed integer multiply (and modify condition codes)

SRA Shift right arithmetic

SRAX Shift right arithmetic, extended

SRL Shift right logical

SRLX Shift right logical, extended

STB Store byte

STBA Store byte into alternate space

STBAR Store barrier

STBLOCKF 64-byte block store

STDF Store double floating-point

STDFA Store double floating-point into alternate space

STF Store floating-point

STFA Store floating-point into alternate space

STFSR Store floating-point state register

STH Store halfword

STHA Store halfword into alternate space


Opcode Description

• 33

TABLE 5-1 lists the SPARC V9 and sun4v instructions that are not directly implemented in hardware bySPARC M5, and the exception that occurs when an attempt is made to execute them.

STTW Store twin words

STTWA Store twin words into alternate space

STW Store word

STWA Store word into alternate space

STX Store extended

STXA Store extended into alternate space

STXFSR Store extended floating-point state register

SUB (SUBcc) Subtract (and modify condition codes)

SUBC (SUBCcc) Subtract with carry (and modify condition codes)

SWAP Swap integer register with memory

SWAPA Swap integer register with memory in alternate space

TADDcc(TADDccTV)

Tagged add and modify condition codes (trap on overflow)

TSUBcc(TSUBccTV)

Tagged subtract and modify condition codes (trap on overflow)

Tcc Trap on integer condition codes (with 8-bit sw_trap_number, if bit 7 is set trap tohyperprivileged)

UDIV (UDIVcc) Unsigned integer divide (and modify condition codes)

UDIVX 64-bit unsigned integer divide

UMUL (UMULcc) Unsigned integer multiply (and modify condition codes)

UMULXHI Unsigned 64 x 64 multiply, returning upper 64 product bits

WRASI Write ASI register

WRASR Write ancillary state register

WRCCR Write condition codes register

WRFPRS Write floating-point registers state register

WRPR Write privileged register

WRY Write Y register

XMULX{HI} XOR multiply

XNOR (XNORcc) Exclusive-nor (and modify condition codes)

XOR (XORcc) Exclusive-or (and modify condition codes)

1. The PC format saved by the CALL instruction is the same as the format of the PC register spec-ified in Section 3.2.2, Program Counter (PC), on page 18.

TABLE 5-2 Oracle SPARC Architecture 2011 Instructions Not Directly Implemented by SPARC M5 Hardware (1 of 2)

Opcode Description Exception

FABSq Floating-point absolute value quad illegal_instruction

FADDq Floating-point add quad illegal_instruction

FCMPq Floating-point compare quad illegal_instruction

FCMPEq Floating-point compare quad (exception if unordered) illegal_instruction

FDIVq Floating-point divide quad illegal_instruction

FdMULq Floating-point multiply double to quad illegal_instruction

FiTOq Convert integer to quad floating-point illegal_instruction

FMOVq Floating-point move quad illegal_instruction

FMOVqcc Move quad floating-point register if condition is satisfied illegal_instruction


Opcode Description


5.2 SPARC M5-Specific Instructions

5.3 PREFETCH/PREFETCHASee the PREFETCH and PREFETCHA instruction descriptions in the Oracle SPARC Architecture 2011specification for the standard definitions of these instructions. This section describes how SPARC M5handles PREFETCH instructions.

SPARC M5 interprets the function codes for prefetch variants as follows:

FMOVqr Move quad floating-point register if integer register contents satisfycondition

illegal_instruction

FMULq Floating-point multiply quad illegal_instruction

FNEGq Floating-point negate quad illegal_instruction

FSQRTq Floating-point square root quad illegal_instruction

F(s,d,q)TO(q) Convert between floating-point formats to quad illegal_instruction

FQTOI Convert quad floating point to integer illegal_instruction

FQTOX Convert quad floating point to 64-bit integer illegal_instruction

FSUBq Floating-point subtract quad illegal_instruction

FxTOq Convert 64-bit integer to floating-point illegal_instruction

IMPDEP1 (not listedin TABLE 5-1)

Implementation-dependent instruction illegal_instruction

IMPDEP2 (not listedin TABLE 5-1)

Implementation-dependent instruction illegal_instruction

LDQF Load quad floating-point illegal_instruction

LDQFA Load quad floating-point into alternate space illegal_instruction

STQF Store quad floating-point illegal_instruction

STQFA Store quad floating-point into alternate space illegal_instruction

TABLE 5-3 SPARC M5 interpretation of prefetch variants

fcn Prefetch Variant Action

0 Weak prefetch for several reads Prefetch to L1 data cache and L2 cache

1 Weak prefetch for one read Prefetch to L2 cache

2 Weak prefetch for several writes Prefetch to L2 cache (exclusive)

3 Weak prefetch for one write Prefetch to L2 cache (exclusive)

4 Prefetch Page NOP - no action taken

5 - 15 Reserved illegal_instruction trap

16 NOP NOP - no action taken

17 Strong prefetch to nearest unifiedcache

Prefetch to L2 cache

18 - 19 NOP NOP - no action taken

20 Strong prefetch for several reads Prefetch to L1 data cache and L2 cache

21 Strong prefetch for one read Prefetch to L2 cache

TABLE 5-2 Oracle SPARC Architecture 2011 Instructions Not Directly Implemented by SPARC M5 Hardware (2 of 2)

Opcode Description Exception

• 35

..

5.4 WRPAUSEWRPAUSE is a mnemonic for a WRASR to ASR 27, the PAUSE register.

Writing to the PAUSE register suspends a strand for a specified number of processor cycles. ThePAUSE register is write-only; the PAUSE register cannot be read. SPARC M5 implements a 15-bitPAUSE register as described below:

22 Strong prefetch for several writes Prefetch to L2 cache (exclusive)

23 Strong prefetch for one write Prefetch to L2 cache (exclusive)

24-31 NOP NOP - no action taken

ProgrammingNote

SPARC M5 does not implement any prefetch functions thatprefetch solely to the L3 cache.

ImplementationNote

On SPARC M5, prefetches can be dropped either at the L1 datacache, the L2 cache, or the L3 cache. Prefetches may be droppedregardless of whether they are strong or weak. Weak prefetchesare dropped if they miss the DTLB, whereas strong prefetchesare dropped if hardware tablewalk returns an error or is notenabled; otherwise, the following conditions apply to eithertype. Prefetches are dropped when:

1. The prefetch is to an I/O page, or a page marked as non-cacheable or with side-effects.

2. The miss buffer in the L1 data cache fills beyond a high-water mark (this only applies when more than one threadis unparked).

3. The prefetch is for a data cache miss which is alreadyoutstanding.

4. The prefetch is a read prefetch that hits in the L1 cache.

5. The prefetch is a read prefetch to L2 which hits in the L2cache.

6. The prefetch is a write prefetch which exists in the L2cache in the exclusive state.

7. The prefetch misses in the L2 cache, and the L2 missbuffer fills beyond a high water mark.

8. The prefetch misses in the L3 cache, and the L3 missbuffer fills beyond a high water mark.

TABLE 5-3 SPARC M5 interpretation of prefetch variants (Continued)

fcn Prefetch Variant Action


When WRPAUSE is executed, the following sequence of events occurs in SPARC M5:

1. Hardware places the strand in a paused state1. While in this state, execution of any subsequentinstructions to WRPAUSE for this strand is temporarily suspended. Instruction fetch continues forthis strand until the strand’s instruction buffer is filled.

2. Hardware checks the value that will be written to the PAUSE register. Hardware updates thestrand’s PAUSE register with the value of ((min (215 - 1, (R[rs1] xor simm13)) >> 3)) or the value((min (215 - 1, (R[rs1] xor R[rs2])) >> 3)), depending upon the instruction format. If the valuewritten to PAUSE is 0, hardware will pause the strand for 1 cycle. The value placed in the PAUSEregister is divided by 8 since each strand’s PAUSE register is processed every 8th processor clockcycle. Thus the actual duration of a WRPAUSE ranges from 1 to a maximum of 32760 cycles.

3. Hardware decrements the PAUSE register every 8th processor clock cycle. The strand remains inthe paused state until any of the following events occur; any one of which immediately forces thePAUSE register to be reset to 0:

a. the PAUSE register decrements to zero

b. an unmasked disrupting trap request is received

c. ,a deferred trap request is received

d. certain hyperprivileged events occur

e. Also see Oracle SPARC Architecture 2011 for details on what events can terminate a WRPAUSEoperation.

4. When the PAUSE register becomes 0, SPARC M5 resumes instruction fetch and execution at theNPC of the WRPAUSE2.

A masked trap request does not affect the PAUSE register or suspension of the strand.

Any disrupting trap request that is posted after WRPAUSE has updated the PAUSE register and thestrand has suspended forward progress does not result in a trap being taken on the WRPAUSEinstruction; the trap is taken on a later instruction. This ensures forward progress when the traphandler retries the instruction on which the trap was taken.

TABLE 5-4 PAUSE Register


63:15 — WO Reserved.

14:3 pause WO Pause value from 0..32767 cycles

2:0 — WO Ignored.

1. SPARC M5 post-sync’s the WRPAUSE instruction; hardware prevents any subsequent instruction from this strand from entering thepipeline at the Select stage.

2. Hardware releases the post-sync at the Select stage, enabling subsequent instructions to enter the pipeline.

ProgrammingNote

WRPAUSE is intended to be used as part of a progressive(exponential) backoff algorithm.

ProgrammingNote

In SPARC M5, care must be taken with the frequency of branchinstructions when coding loops using WRPAUSE. SeeSection A.4 for details.

• 37

5.5 Block Load and Store InstructionsSee the LDBLOCKF and STBLOCKF instruction descriptions in the Oracle SPARC Architecture 2011specification for the standard definitions of these instructions.

Block store commits in SPARC M5 do NOT force the data to be written to memory as specified in theOracle SPARC Architecture 2011 specification. Block store commits are implemented the same asblock stores in SPARC M5. As with all stores, block stores and block store commits maintaincoherency with all I-caches, but will not flush any modified instructions executing down a pipeline.Flushing those instructions requires the pipeline to execute a FLUSH instruction.

Notes If LDBLOCKF is used with an ASI_BLK_COMMIT_{P,S} and adestination register number rd is specified which is not amultiple of 8 (a misaligned rd), SPARC M5 generates anillegal_instruction exception (impl. dep. #255-U3-Cs10).

If LDBLOCKF is used with an ASI_BLK_COMMIT_{P,S} and amemory address is specified with less than 64-byte alignment,SPARC M5 generates a DAE_invalid_ASI exception (impl. dep.#256-U3)


SPARC M5 treats LDBLOCKF as interlocked with respect to following instructions. All laterinstructions see the effect of the newly loaded values.

STBLOCKF source data registers are interlocked against completion of previous instructions,including block load instructions; STBLOCKF instructions don’t commit until all previous instructionscommit. Thus STBLOCKF instructions read the most recent value of the floating-point sourceregister(s) when committing to memory. STBLOCKF instructions may or may not initialize the targetmemory locations to 0 prior to updating them with the source data. Thus another strand may observethese intermediate zero values prior to observing the final source data value.

LDBLOCKF does not follow memory model ordering with respect to stores. In particular, a read-after-write hazard to overlapping addresses is not detected. The side-effect bit associated with theaccess is ignored (see Translation Table Entry (TTE) on page 83). If ordering with respect to earlierstores is important (for example, a block load that overlaps previous stores), then there must be anintervening MEMBAR #StoreLoad (or stronger MEMBAR). If the LDBLOCKF overlaps a previousstore and there is no intervening MEMBAR or data reference, the LDBLOCKF may return data frombefore or after the store.

These instructions are used for transferring large blocks of data(more than 256 bytes); for example, memcpy() and memset().On SPARC M5, a block load forces a miss in the primary cacheand will not allocate a line in the primary cache, but doesallocate in L2.

SPARC M5 breaks block load and store instructions into 8individual "helper" instructions. Each helper is translated as anindependent instruction. Thus, it is possible that any individualhelper or set of helpers translates to a different memory pagefrom other helpers from the same instruction, if the underlyingmemory mapping is changed by another process during theexecution of the block instruction. Any individual helper or setof helpers may also trap if memory mapping attributes arechanged by another process in the midst of a series of helpertranslations. In the event multiple helpers have exceptions,SPARC M5 commits the helpers in program order from thelowest virtual address to the highest virtual address. Thus, thehelper with the lowest virtual address which experiences anexception determines which trap will be taken. SPARC M5makes no guarantee about the atomicity of address translationfor block operations.

Block stores execute differently on SPARC M5 than on priorUltraSPARC processors. On previous processors, such asUltraSPARC T2, UltraSPARC T2+, and SPARC T3, block storesfetched the data from memory prior to updating the line withthe store data. On SPARC M5, the processor first establishes theline in the L2 cache and zeroes the data, prior to updating theline with the store source data. The block store is helperized into8 individual block init stores. The first helper establishes the linein the L2 cache, zeroes the line out, then updates the first 8 bytesof the line with the first 8 bytes of the store source data. Theremaining seven helpers collectively update the remaining 56bytes with the remaining 56 bytes of store source data. As aresult, it is possible for another process to see the old data, thenew data, or a value of zero while the block store is beingexecuted.

• 39

STBLOCKF instructions do not conform to TSO store-store ordering with respect to older non-overlapping stores. A subsequent load to the same address as a STBLOCKF may not read the resultsof the STBLOCKF. The side-effects bit associated with the access is ignored. If ordering with respectto later loads is important then there must be an intervening MEMBAR instruction. If the STBLOCKFoverlaps a later load and there is no intervening MEMBAR #StoreLoad instruction, the result of theload is undefined.

CompatibilityNotes

Block load and store operations do not obey the orderingrestrictions of the currently selected processor memory model(TSO, PSO, or RMO); block operations always execute under anRMO memory ordering model. In general, explicit MEMBARinstructions are required to order block memory operationsamong themselves or with respect to normal loads and stores. Inaddition, block operations do not generally conform todependence order on the issuing virtual processor; that is, noread-after-write or write-after-read checking occurs betweenblock loads and stores. Explicit MEMBARs are required toenforce dependence ordering between block operations thatreference the same address. However, SPARC M5 partiallyorders some block operations.

TABLE 5-5 describes the synchronization primitives required inSPARC M5, if any, to guarantee TSO ordering between varioussequences of memory reference operations. The first columncontains the reference type of the first or earlier instruction; thesecond column contains the reference type of the second or thelater instruction. SPARC M5 orders loads and block loadsagainst all subsequent instructions.

TABLE 5-5 SPARC M5 Synchronization Requirements for Memory Reference Operations

First reference Second reference Synchronization Required

Load Load —

Block load MEMBAR #LoadLoad, #StoreLoad, #MemIssue, or#Sync

Store —

Block store —

Block load Load —

Block load MEMBAR #LoadLoad, #StoreLoad, #MemIssue, or#Sync

Store —

Block store —

Store Load —

Block load MEMBAR #StoreLoad or #Sync

Store —

Block store MEMBAR #StoreStore or stronger, if to non-overlapping addresses


5.6 Integer Multiply-AddSPARC M5 provides nonprivilged unsigned integer multiply-add instructions, FPMADDX andFPMADDXHI. More details regarding these instructions can be found in the Oracle SPARCArchitecture 2011 specification.

Block store Load MEMBAR #StoreLoad or #Sync

Block load MEMBAR #StoreLoad or #Sync

Store MEMBAR #StoreStore or stronger, if to non-overlapping addresses

Block store MEMBAR #StoreStore or stronger, if to non-overlapping addresses

TABLE 5-5 SPARC M5 Synchronization Requirements for Memory Reference Operations

First reference Second reference Synchronization Required

• 41

5.7 Compare and BranchSPARC M5 provides nonprivilged compare-and-branch instructions, as follows:

More details regarding the vompare-and-branch instructions can be found in the Oracle SPARCArchitecture 2011 specification.

Opcode Operation Test

32-bit Compare and Branch OperationsCWBNE Compare and Branch if Not Equal not ZCWBE Compare and Branch if Equal ZCWBG Compare and Branch if Greater not (Z or (N xor V))CWBLE Compare and Branch if Less or Equal Z or (N xor V)CWBGE Compare and Branch if Greater or Equal not (N xor V)CWBL Compare and Branch if Less N xor VCWBGU Compare and Branch if Greater Unsigned not (C or Z)CWBLEU Compare and Branch if Less or Equal

UnsignedC or Z

CWBCC Compare and Branch if Carry Clear(Greater Than or Equal, Unsigned)

not C

CWBCS Compare and Branch if Carry Set(Less Than, Unsigned)

C

CWBPOS Compare and Branch if Positive not NCWBNEG

Compare and Branch if Negative N

CWBVC Compare and Branch if Overflow Clear not VCWBVS Compare and Branch if Overflow Set V

64-bit Compare and Branch OperationsCXBNE Compare and Branch if Not Equal not ZCXBE Compare and Branch if Equal ZCXBG Compare and Branch if Greater not (Z or (N xor V))CXBLE Compare and Branch if Less or Equal Z or (N xor V)CXBGE Compare and Branch if Greater or Equal not (N xor V)CXBL Compare and Branch if Less N xor VCXBGU Compare and Branch if Greater Unsigned not (C or Z)CXBLEU Compare and Branch if Less or Equal

UnsignedC or Z

CXBCC Compare and Branch if Carry Clear(Greater Than or Equal, Unsigned)

not C

CXBCS Compare and Branch if Carry Set(Less Than, Unsigned)

C

CXBPOS Compare and Branch if Positive not NCXBNEG Compare and Branch if Negative NCXBVC Compare and Branch if Overflow Clear not VCXBVS Compare and Branch if Overflow Set V

† synonym: cbnz{x} ‡ synonym: cbz{x} ◊

synonym: cbgeu{x} ∇ synonym: cblu{x}


5.8 AES Operations (4 operand)SPARC M5 provides nine nonprivilged, 4-operand instructions to support the AES cryptographicalgorithm, as follows:

More details regarding these instructions can be found in the Oracle SPARC Architecture 2011specification.

5.9 AES Operations (3 operand)SPARC M5 provides two nonprivilged, 3-operand instructions to support the AES cryptographicalgorithm, as follows:


5.10 DES Operations (4 operand)SPARC M5 provides one nonprivilged, 4-operand instruction to support the DES cryptographicalgorithm, as follows:

More details regarding this instruction can be found in the Oracle SPARC Architecture 2011specification.

Instruction Operation

AES_EROUND01 AES Encrypt columns 0&1

AES_EROUND23 AES Encrypt columns 2&3

AES_DROUND01 AES Decrypt columns 0&1

AES_DROUND23 AES Decrypt columns 2&3

AES_EROUND01_LAST AES Encrypt columns 0&1 last round

AES_EROUND23_LAST AES Encrypt columns 2&3 last round

AES_DROUND01_LAST AES Decrypt columns 0&1 last round

AES_DROUND23_LAST AES Decrypt columns 2&3 last round

AES_KEXPAND1 AES Key expansion with RCON


AES_KEXPAND0 AES Key expansion without RCONAES_KEXPAND2 AES Key expansion without SBOX


DES_ROUND two DES round operations

Crypto

Crypto

Crypto

• 43

5.11 DES Operations (2 operand)SPARC M5 provides three nonprivilged, 2-operand instructions to support the DES cryptographicalgorithm, as follows:


5.12 Camellia Operations (4 operand)SPARC M5 provides one nonprivilged, 4-operand instruction to support the Camellia cryptographicalgorithm, as follows:


5.13 Camellia Operations (3 Operand)SPARC M5 provides two nonprivilged, 3-operand instructions to support the Camellia cryptographicalgorithm, as follows:


5.14 Hash OperationsSPARC M5 provides four nonprivilged instructions to support cryptographic digest (hash) algorithms,as follows:


DES_IP DES Initial PermutationDES_IIP DES Inverse Initial PermutationDES_KEXPAND DES Key Expand


CAMELLIA_F Camellia F operation


CAMELLIA_FL Camellia FL operationCAMELLIA_FLI Camellia FLI operation


MD5 MD5 operation on a single block

SHA1 SHA1 operation on a single block



Crypto

Crypto

Crypto

Crypto


More details regarding the hash instructions can be found in the Oracle SPARC Architecture 2011specification.

5.15 CRC32C Operation (3 operand)SPARC M5 provides one nonprivilged instruction to support the CRC32c checksum operation, asfollows:

More details regarding the CRC32c instructions can be found in the Oracle SPARC Architecture 2011specification.

5.16 MPMULSPARC M5 provides a nonprivilged instruction to support the multiple-precision multiplicationoperation, as follows:


5.17 MONTMULSPARC M5 provides a nonprivilged instruction to support the Montgomery multiplication operation,as follows:


5.18 MONTSQRSPARC M5 provides a nonprivilged instruction to support the Montgomery squaring operation, asfollows:



CRC32C two CRC32c operations


MPMUL Multiple Precision Multiply


MONTMUL Montgomery Multiplication


MONTSQR Montgomery Squaring

Crypto

Crypto

Crypto

Crypto

• 45

5.19 Kasumi Operations (4 operand)

Description Kasumi is a block cipher that produces a 64-bit output from a 64-bit input under the control of a 128-bit key. The Kasumi cipher has eight rounds. Each round consists of an FL function, an FO function,and an XOR operation. Each FO is composed of three FI functions with xors above and below each FI.Odd rounds apply the functions FL, FO, XOR whereas even rounds order the functions FO, FL andXOR. A number of temporary variables are used in the functional descriptions below. For example,data_fl is the result of applying the FL operation to the rs1 data using the rs2 key. The Kasumiinstructions operate on 64-bit floating-point registers.

CFR.kasumi must be set; otherwise a compatibility_feature trap results.

KASUMI_FL_XOR:data_fl{31:0} ← kasumi FL(data=FD[rs1]{31:0} , key=FD[rs2]{31:0});FD[rd]{31:0} ← data_fl[31:0} xor FD[rs3]{31:0};FD[rd]{63:32} ← 0000 000016

where FD[rs2]{63:0} = ( 32 unused :: KL(i1) :: KL(i2) )

KASUMI_FI_XOR:data_x1{15:0} ← FD[rs1]{31:16} xor FD[rs2]{63:48};data_fi{15:0} ← kasumi FI(data_x1[15:0}, FD[rs2]{47:32});data_x2{15:0} ← data_fi[15:0} xor FD[rs1]{15:0};data_x2{31:16} ← FD[rs1]{15:0};FD[rd]{31:0} ← data_x2[31:0} xor FD[rs3]{31:0};FD[rd]{63:32} ← 0000 000016;

where FD[rs2]{63:0} = ( KO(i3) :: KI(i3) :: 32 unused )

The Kasumi instructions are new and not expected to be implemented on allOracle SPARC Architecture implementations. Therefore, they should only beused in platform-specific dynamically-linked libraries or in software created by aruntime code generator that is aware of the specific virtual processorimplementation on which it is executing.

Instruction op5 Operation Assembly Language Syntax Class

KASUMI_FL_XOR 1010 Kasumi FL followed by xor kasumi_fl_xor fregrs1, fregrs2, fregrs3, fregrd N1

KASUMI_FI_XOR 1011 Kasumi FI followed by xor kasumi_fi_xor fregrs1, fregrs2, fregrs3, fregrd N1

Crypto

10 011001 rs2rd rs1

31 141924 18 13 02530 29 4

op5

5

rs3

9 8


Exceptions fp_disabled

ProgrammingNote

The Kasumi instructions are components of the overall Kasumi algorithm. Toperform an encryption or decryption, software must first expand the key. Keyexpansion is done only once per session key. The expanded keys are then appliedto all blocks for that session. The following example has the expanded keys inFD[0] thru FD[46]. The initial 64-bit data is split into left and right halves andloaded into the lower half of FD[52] and FD[54] respectively. FD[56] must beinitialized to 0. For each block, the following instruction sequence can be applied :

kasumi_fl_xor %f52, %f0 , %f56, %f58 !# Round 1kasumi_fi_fi %f58, %f2 , %f58kasumi_fi_xor %f58, %f4 , %f54, %f54kasumi_fi_fi %f54, %f6 , %f58 !# Round 2kasumi_fi_xor %f58, %f8 , %f56, %f58kasumi_fl_xor %f58, %f10, %f52, %f52kasumi_fl_xor %f52, %f12, %f56, %f58 !# Round 3kasumi_fi_fi %f58, %f14, %f58kasumi_fi_xor %f58, %f16, %f54, %f54kasumi_fi_fi %f54, %f18, %f58 !# Round 4kasumi_fi_xor %f58, %f20, %f56, %f58kasumi_fl_xor %f58, %f22, %f52, %f52kasumi_fl_xor %f52, %f24, %f56, %f58 !# Round 5kasumi_fi_fi %f58, %f26, %f58kasumi_fi_xor %f58, %f28, %f54, %f54kasumi_fi_fi %f54, %f30, %f58 !# Round 6kasumi_fi_xor %f58, %f32, %f56, %f58kasumi_fl_xor %f58, %f34, %f52, %f52kasumi_fl_xor %f52, %f36, %f56, %f58 !# Round 7kasumi_fi_fi %f58, %f38, %f58kasumi_fi_xor %f58, %f40, %f54, %f54kasumi_fi_fi %f54, %f42, %f58 !# Round 8kasumi_fi_xor %f58, %f44, %f56, %f58kasumi_fl_xor %f58, %f46, %f52, %f52

• 47

5.20 Kasumi Operations (3 operand)

Description The KASUMI_FI_FI instruction is one component of the Kasumi cipher. It operates on 64-bit floatin-point registers.

CFR.kasumi must be set; otherwise a compatibility_feature trap results.

KASUMI_FI_FI:data_x1{31:16} ← FD[rs1]{31:16} xor FD[rs2]{63:48};data_x1{15:0} ← FD[rs1]{15:0} xor FD[rs2]{31:16};data_fi{31:16} ← kasumi FI ( data_x1{31:16}, FD[rs2]{47:32} );data_fi{15:0} ← kasumi FI ( data_x1{15:0}, FD[rs2]{15:0} );FD[rd]{31:16} ← data_fi{31:16} xor FD[rs1]{15:0};FD[rd]{15:0} ← data_fi{31:16} xor FD[rs1]{15:0} xor data_fi{15:0};FD[rd]{63:32} ← 0000 000016;

where FD[rs2]{63:0} = ( KO(i1) :: KI(i1) :: KO(i2) :: KI(i2) )

Exceptions fp_disabled

Instruction opf Operation Assembly Language Syntax Class

KASUMI_FI_FI 1 0011 1000 two Kasumi FI operations kasumi_fi_fi fregrs1, fregrs2, fregrd N1

Crypto

rd10 110110 opfrs1 rs2

31 24 02530 29 19 18 14 13 5 4


CHAPTER 6

Traps

6.1 Trap LevelsOnly SPARC M5 specific behavior is described in this chapter; refer to Oracle SPARC Architecture2011 for more detail on trap handling.

Each virtual processor supports two trap levels (MAXPTL = 2).

6.2 Trap BehaviorTABLE 6-1 specifies the codes used in the tables below.

Programming Note – TABLE 6-2 only contains those traps in which SPARC M5 differs from OracleSPARC Architecture 2011; not all traps are listed. Refer to Oracle SPARC Architecture 2011 for more detail.

TABLE 6-1 Table Codes

Code Meaning

H Trap is taken in Hyperprivileged mode

P Trap is taken via the Privileged trap table, in Privileged mode (PSTATE.priv = 1)

-x- Not possible. Hardware cannot generate this trap in the indicated running mode. For example, allprivileged instructions can be executed in privileged mode, therefore a privileged_opcode trapcannot occur in privileged mode.

— This trap can only legitimately be generated by hyperprivileged software, not by the CPUhardware. So, for the purposes of sun4v, the trap vector has to be correct, but for a hardware CPUimplementation these trap types are not generated by the hardware, therefore the resultantrunning mode is irrelevant.

49

CHAPTER 7

Interrupt Handling

The chapter describes the hardware interrupt delivery mechanism for the SPARC M5 chip.

Hyperprivileged code notifies privileged code about interrupt_vector, sw_recoverable_error, andhw_corrected_error traps (and precise error traps) through the cpu_mondo, dev_mondo, andresumable_error traps as described in Interrupt Queue Registers on page 52. Software interrupts aredelivered to each virtual processor using the interrupt_level_n traps. Software interrupts are describedin the Oracle SPARC Architecture 2011 specification.

Details of I/O interrupt processing are given in Chapter 22, PCI-Express. A high-level summary isgiven here. Any event generated by an I/O device which requires a CPU thread to be interrupted ispassed through the DMU sitting at the root of the tree containing that I/O device. Note that allclusters hosted by a DMU (PCIe root ports, NIU Ethernet ports, and third-party IP) have a PCI-Express programming model. The implication is that an event requiring an interrupt be sent arrives atthe DMU in a PCIe compatible way: as a PCIe Messages, MSI (Message Signalled Interrupt), MSI-X(expanded MSI), or INTx (legacy wire-based interrupt used by older PCIe or PCI devices). The DMUmanages a number of Event Queues (EQs) in main memory. The DMU keeps track of tail pointers,which it updates when writing the EQs, and head pointers, which software updates when reading theEQs. When a message, MSI/X, or INTx arrives at the DMU it maps from the message, MSI/X, or INTxID to a specific Event Queue. The DMU contains a set of CSRs to do this mapping. The DMU writesan entry to the given EQ which contains information useful to software about the event. If the EQtransitions from empty to non-empty as a result of the new entry written, the DMU sends a mondoInterrupt Request (IREQ) transaction over the IOX to the local NCU. The IREQ contains thedestination thread for the interrupt, which may be either a local thread or a thread on a remote SPARCM5 node.1 The NCU maintains a state machine for each of the local threads, keeping track of whethera mondo is already pending to each respective thread. If the IREQ specifies a local thread and thatthread already has a mondo pending, the IREQ is dropped because a thread can handle only onemondo outstanding and Hypervisor will eventually see the new EQ entry during the course of itsprocessing the prior mondo. If the IREQ specifies a local thread and that thread does not have amondo pending, the NCU generates a mondo to the given thread. Since no useful data is sent, mondodata is no longer available in SPARC M5 as is was in some earlier UltraSPARC processors beforeSPARC T3. If the IREQ specifies a remote thread, the NCU passes it back to the local IOX fabric, andit is routed to the destination SPARC M5 node where the NCU there receives it and processes it asdescribed above.

A mondo is processed by Hypervisor, which reads EQ or EQs mapped to the given mondo it hasreceived, takes the appropriate actions, calling device drivers as necessary to clear interrupt status inthe originating device, and updates the header pointer(s) for the given EQ or EQs. Note that multipleEQs managed by a given DMU may map to the same thread, and multiple EQs managed by differentDMUs (even DMUs on different SPARC M5 nodes) may map to the same thread. The software flowfor mondo handling is covered in more detail in the PCIe chapter.

1. SPARC T3 and SPARC T4 did not support remote interrupts, meaning that a mondo could be directed only to a local thread. BecauseSPARC M5 supports core off-lining for power reduction, it is necessary to support directing interrupts to remote threads where coresare powered. The mapping of EQ to threads can be changed dynamically through programming DMU CSRs.

51

7.1 Interrupt Flow

7.1.1 SourcesCPU cross-call interrupts can be generated by writing the Interrupt Vector Dispatch register describedin Interrupt Vector Dispatch Register on page 145. Dispatching inter-CPU interrupts is described inDispatching on page 139.

JTAG TAP interrupts can be generated by writing the NCU Interrupt Vector/Trap Dispatch Registerdescribed in CPU Interrupt Registers on page 52, via the JTAG port.

SSI interrupts (device ID = 2) are caused by an assertion (edge trigger) on the EXT_INT_L pin.

I/O interrupts arrive to the CPU thread as mondos, as described in the previous section.

7.2 CPU Interrupt Registers

7.2.1 Interrupt Queue RegistersEach virtual processor has eight ASI_QUEUE registers at ASI = 2516, VA{63:0} = 3C016-3F816 that areused for communicating interrupts to the operating system. These registers contain the head and tailpointers for four supervisor interrupt queues: cpu_mondo, dev_mondo, resumable_error,nonresumable_error. The tail registers are read-only by supervisor, and read/write by hypervisor.Writes to the tail registers by the supervisor generate a DAE_invalid_ASI trap. The head registers areread/write by both supervisor and hypervisor.

Whenever the CPU_MONDO_HEAD register does not equal the CPU_MONDO_TAIL register, acpu_mondo trap is generated. Whenever the DEV_MONDO_HEAD register does not equal theDEV_MONDO_TAIL register, a dev_mondo trap is generated. Whenever theRESUMABLE_ERROR_HEAD register does not equal the RESUMABLE_ERROR_TAIL register, aresumable_error trap is generated. Unlike the other queue register pairs, the nonresumable_error trapis not automatically generated whenever the NONRESUMABLE_ERROR_HEAD register does notequal the NONRESUMABLE_ERROR_TAIL register; instead, the hypervisor will need to generate thenonresumable_error trap.TABLE 7-1 through TABLE 7-8 define the format of the eight ASI_QUEUEregisters.

TABLE 7-1 CPU Mondo Head Pointer – ASI_QUEUE_CPU_MONDO_HEAD (ASI 2516, VA 3C016)

Bit Field Initial Value Access Description

63:31 — 0 RO Reserved

30:6 head X RW Head pointer for CPU mondo interrupt queue.


TABLE 7-2 CPU Mondo Tail Pointer – ASI_QUEUE_CPU_MONDO_TAIL (ASI 2516, VA 3C816)



30:6 tail X RW Tail pointer for CPU mondo interrupt queue.



TABLE 7-3 Device Mondo Head Pointer – ASI_QUEUE_DEV_MONDO_HEAD (ASI 2516, VA 3D016)



30:6 head X RW Head pointer for device mondo interrupt queue.


TABLE 7-4 Device Mondo Tail Pointer – ASI_QUEUE_DEV_MONDO_TAIL (ASI 2516, VA 3D816)



30:6 tail X RW Tail pointer for device mondo interrupt queue.


TABLE 7-5 Resumable Error Head Pointer – ASI_QUEUE_RESUMABLE_HEAD (ASI 2516, VA 3E016)


63:31 — 0 RO Reserved.

30:6 head X RW Head pointer for resumable error queue.


TABLE 7-6 Resumable Error Tail Pointer – ASI_QUEUE_RESUMABLE_TAIL (ASI 2516, VA 3E816)



30:6 tail X RW Tail pointer for resumable error queue.


TABLE 7-7 Nonresumable Error Head Pointer – ASI_QUEUE_NONRESUMABLE_HEAD (ASI 2516, VA 3F016)



30:6 head X RW Head pointer for nonresumable error queue.


TABLE 7-8 Nonresumable Error Tail Pointer – ASI_QUEUE_NONRESUMABLE_TAIL (ASI 2516, VA 3F816)



30:6 tail X RW Tail pointer for nonresumable error queue.


• 53

CHAPTER 8

Memory Models

SPARC V9 defines the semantics of memory operations for three memory models. From strongest toweakest, they are Total Store Order (TSO), Partial Store Order (PSO), and Relaxed Memory Order(RMO). The differences in these models lie in the freedom an implementation is allowed in order toobtain higher performance during program execution. The purpose of the memory models is tospecify any constraints placed on the ordering of memory operations in uniprocessor and shared-memory multiprocessor environments. SPARC M5 supports only TSO, with the exception that certainASI accesses (such as block loads and stores) may operate under RMO.

Although a program written for a weaker memory model potentially benefits from higher executionrates, it may require explicit memory synchronization instructions to function correctly if data isshared. MEMBAR is a SPARC V9 memory synchronization primitive that enables a programmer tocontrol explicitly the ordering in a sequence of memory operations. Processor consistency isguaranteed in all memory models.

The current memory model is indicated in the PSTATE.mm field. It is unaffected by normal traps.SPARC M5 ignores the value set in this field and always operates under TSO.

A memory location is identified by an 8-bit address space identifier (ASI) and a 64-bit virtual address.The 8-bit ASI may be obtained from a ASI register or included in a memory access instruction. TheASI is used to distinguish between and provide an attribute for different 64-bit address spaces. Forexample, the ASI is used by the SPARC M5 MMU to control access to implementation-dependentcontrol and data registers and for access protection. Attempts by nonprivileged software(PSTATE.priv = 0) to access restricted ASIs (ASI{7} = 0) cause a privileged_action trap.

Real memory spaces can be accessed without side effects. For example, a read from real memory spacereturns the information most recently written. In addition, an access to real memory space does notresult in program-visible side effects.

8.1 Supported Memory ModelsThe following sections contain brief descriptions of the two memory models supported by SPARC M5.These definitions are for general illustration. Detailed definitions of these models can be found in TheSPARC Architecture Manual-Version 9. The definitions in the following sections apply to systembehavior as seen by the programmer.

Notes Stores to SPARC M5 internal ASIs, block loads, and block storesand block initializing stores are outside the memory model; thatis, they need MEMBARs to control ordering.

Atomic load-stores are treated as both a load and a store and canonly be applied to cacheable address spaces.

55

8.1.1 TSOSPARC M5 implements the following programmer-visible properties in Total Store Order (TSO) mode:

■ Loads are processed in program order; that is, there is an implicit MEMBAR #LoadLoad betweenthem.

■ Loads may bypass earlier stores. Any such load that bypasses such earlier stores must check(snoop) the store buffer for the most recent store to that address. A MEMBAR #Lookaside is notneeded between a store and a subsequent load at the same noncacheable address.

■ A MEMBAR #StoreLoad must be used to prevent a load from bypassing a prior store if StrongSequential Order is desired.

■ Stores are processed in program order.

■ Stores cannot bypass earlier loads.

■ Accesses to I/O space are all strongly ordered with respect to each other.

■ An L2 cache update is delayed on a store hit until all outstanding stores reach global visibility. Forexample, a cacheable store following a noncacheable store is not globally visible until thenoncacheable store has reached global visibility; there is an implicit MEMBAR #MemIssuebetween them.

8.1.2 RMOSPARC M5 implements the following programmer-visible properties for special ASI accesses thatoperate under Relaxed Memory Order (RMO) mode:

■ There is no implicit order between any two memory references, either cacheable or noncacheable,except that noncacheable accesses to I/O space) are all strongly ordered with respect to each other.

■ A MEMBAR must be used between cacheable memory references if stronger order is desired. AMEMBAR #MemIssue is needed for ordering of cacheable after noncacheable accesses.


CHAPTER 9

Address Spaces and ASIs

9.1 Address SpacesSPARC M5 supports a 52-bit virtual address space.

9.1.1 52-bit Virtual and Real Address SpacesSPARC M5 supports a 52-bit subset of the full 64-bit virtual and real address spaces. Although the full64 bits are generated and stored in integer registers, legal addresses are restricted to two equal halvesat the extreme lower and upper portions of the full virtual (real) address space. Virtual (real)addresses between 0008 0000 0000 000016 and FFF7 FFFF FFFF FFFF16 inclusive lie within a “VA hole”(“RA hole”), are termed “out-of-range”1, and are illegal. Prior UltraSPARC implementationsintroduced the additional restriction on software to not use pages within 4 Gbytes of the VA (RA) holeas instruction pages to avoid problems with prefetching into the VA (RA) hole. SPARC M5implements a hardware check for instruction fetching near the VA (RA) hole and generates a trapwhen instructions are executed from a location in the address range 0007 FFFF FFFF FFE016 to0007 FFFF FFFF FFFF16, inclusive. However, even though SPARC M5 provides this hardwarechecking, it is still recommended that software should not use the 8-Kbyte page before the VA (RA)hole for instructions. Address translation and MMU related descriptions can be found in Translationon page 89.

FIGURE 9-1 SPARC M5’s 52-bit Virtual and Real Address Spaces, With Hole

Throughout this document, when virtual (real) address fields are specified as 64-bit quantities, theyare assumed to be sign-extended based on VA{51} (RA{51}).1. Another way to view an out-of-range address is as any address where bits {63:52} are not all equal to bit {51}.

FFFF FFFF FFFF FFFF

FFF8 0000 0000 0000

0000 0000 0000 0000

0007 FFFF FFFF FFFF

FFF7 FFFF FFFF FFFF

0008 0000 0000 0000

0007 FFFF FFFF DFFFSee Note (1)

Note (1): Use of this region restricted to data only.

Out of Range VA (RA)(the “VA Hole” (“RA Hole”))

57

A number of state registers are affected by the reduced virtual and real address spaces. The PCregister is 52 bits, sign-extended to 64-bits on read accesses. The TBA, TPC, and TNPC registers are 52-bits and their values are not sign-extended when read. No checks are done when these registers arewritten by software. It is the responsibility of privileged software to properly update these registers.

An out-of-range virtual (real) address during an instruction access, caused by execution into the VA(RA) hole or into 0007 FFFF FFFF FFE016 to 0007 FFFF FFFF FFFF16 inclusive, results in a trap ifPSTATE.am = 0.

If the target virtual (real) address of a JMPL, RETURN, branch, or CALL instruction is an out-of-rangeaddress and PSTATE.am = 0, a trap is generated with TPC equal to the address of the JMPL, RETURN,branch, or CALL instruction.

An out-of-range virtual (real) address during a data access results in a trap if PSTATE.am = 0.

9.2 Alternate Address SpacesTABLE 9-4 summarizes the ASI usage in SPARC M5. The Section/Page column contains a reference tothe detailed explanation of the ASI (the page number refers to this chapter). For internal ASIs, thelegal VAs are listed (or the field contains “Any” if all VAs are legal). Only bits 51:0 are checked whendetermining the legal VA range. An access outside the legal VA range generates a DAE_invalid_asitrap.

Notes All internal, nontranslating ASIs in SPARC M5 can only beaccessed using LDXA and STXA.

ASIs 8016–FF16 are unrestricted (access allowed in all modes --nonprivileged, privileged). ASIs 0016–2F16 are restricted toprivileged and hyperprivileged modes.

TABLE 9-1 SPARC M5 ASI Usage (1 of 6)

ASI ASI Name R/W VACopy per

Strand Description Section/Page

0016–0316 Any — DAE_invalid_asi

0416 ASI_NUCLEUS RW Any — Implicit address space,nucleus context, TL > 0

(See OSA2011)

0516–0B16 Any — DAE_invalid_asi

0C16 ASI NUCLEUS_LITTLE RW Any — Implicit address space,nucleus context, TL > 0(LE)

(See OSA2011)

0D16–0F16 Any — DAE_invalid_asi

1016 ASI_AS_IF_USER_PRIMARY RW Any — Primary address space,user privilege

(See OSA2011)

1116 ASI_AS_IF_USER_SECONDARY

RW Any — Secondary address space,user privilege

(See OSA2011)


1416 ASI_REAL RW Any — Real address (normallyused as cacheable)

Section 9.2.1


1516 ASI_REAL_IO RW Any — Real address (normallyused as noncacheable,with side effect)

Section 9.2.1

1616 ASI_BLOCK_AS_IF_USER_PRIMARY

RW Any — 64-byte block load/store,primary address space,user privilege

5.5

1716 ASI_BLOCK_AS_IF_USER_SECONDARY

RW Any — 64-byte block load/store,secondary address space,user privilege

5.5

1816 ASI_AS_IF_USER_PRIMARY_LITTLE

RW Any — Primary address space,user privilege (LE)

(See OSA2011)

1916 ASI_AS_IF_USER_SECONDARY_LITTLE

RW Any — Secondary address space,user privilege (LE)

(See OSA2011)

1A16–1B16 Any — DAE_invalid_asi

1C16 ASI_REAL_LITTLE RW Any — Real address (normallyused as cacheable) (LE)

Section 9.2.1

1D16 ASI_REAL_IO_LITTLE RW Any — Real address (normallyused as noncacheable,with side effect) (LE)

Section 9.2.1

1E16 ASI_BLOCK_AS_IF_USER_PRIMARY_LITTLE

RW Any — 64-byte block load/store,primary address space,user privilege (LE)

5.5

1F16 ASI_BLOCK_AS_IF_USER_SECONDARY_LITTLE

RW Any — 64-byte block load/store,secondary address space,user privilege (LE)

5.5

2016 ASI_SCRATCHPAD RW 016–1816 Y Scratchpad registers Section 9.2.2

2016 ASI_SCRATCHPAD 2016–2816

— DAE_invalid_asi

2016 ASI_SCRATCHPAD RW 3016–3816

Y Scratchpad registers Section 9.2.2

2116 ASI_MMU RW 816 Y I/DMMU PrimaryContext register 0

13.7.2

2116 ASI_MMU RW 1016 Y DMMU SecondaryContext register 0

13.7.2

2116 ASI_MMU RW 10816 Y I/DMMU PrimaryContext register 1

13.7.2

2116 ASI_MMU RW 11016 Y DMMU SecondaryContext register 1

13.7.2

2216 ASI_TWINX_AIUP,ASI_STBI_AIUP

RW Any — Load: 128-bit atomic loadtwin extended word,primary address space,user privilegeStore: Block initializingstore, primary addressspace, user privilege

5.7.4

2316 ASI_TWINX_AIUS,ASI_STBI_AIUS

RW Any — Load: 128-bit atomic loadtwin extended word,secondary address space,user privilegeStore: Block initializingstore

(See OSA2011)

2416 Any — DAE_invalid_asi




• 59

2516 ASI_QUEUE RW 3C016 Y CPU Mondo Queue headpointer

7.2.1

2516 ASI_QUEUE RW(hyperpriv)RO (priv)

3C8 Y CPU Mondo Queue tailpointer

7.2.1

2516 ASI_QUEUE RW 3D016 Y Device Mondo Queuehead pointer

7.2.1


3D816 Y Device Mondo Queuetail pointer

7.2.1

2516 ASI_QUEUE RW 3E016 Y Resumable Error Queuehead pointer

7.2.1


3E816 Y Resumable Error Queuetail pointer

7.2.1

2516 ASI_QUEUE RW 3F016 Y Nonresumable ErrorQueue head pointer

7.2.1

2516 ASI_QUEUE RW (hyper-priv)RO (priv)

3F816 Y Nonresumable ErrorQueue tail pointer

7.2.1

2616 ASI_TWINX_REAL,ASI_STBI_REAL

RW Any — Load:128-bit atomicLDDA, real addressStore: Block initializingstore, real address

(See OSA2011)

2716 ASI_TWINX_NUCLEUS,ASI_STBI_N

RW Any — Load: 128-bit atomic loadtwin extended wordfrom nucleus contextStore: Block initializingstore from nucleuscontext

(See OSA2011)


2A16 ASI_TWINX_AIUPL,ASI_STBI_AIUPL

RW Any — Load: 128-bit atomic loadtwin extended word,primary address space,user privilege, littleendianStore: Block initializingstore, primary addressspace, user privilege,little endian

(See OSA2011)

2B16 ASI_TWINX_AIUSL,ASI_STBI_AIUSL

RW Any — Load: 128-bit atomic loadtwin extended word,secondary address space,user privilege, littleendianStore: Block initializingstore, secondary addressspace, user privilege,little endian

((See OSA2011)

2C16 Any — DAE_invalid_asi

2D16 Any — DAE_invalid_asi

2E16 ASI_TWINX_REAL_LITTLE,ASI_STBI_REAL_LITTLE

RW Any — Load: 128-bit atomicLDDA, real address (LE)Store: Block initializingstore, real address (LE)

(See OSA2011)





2F16 ASI_TWINX_NL,ASI_STBI_NL

RW Any — Load: 128-bit atomic loadtwin extended wordfrom nucleus context,little endianStore: Block initializingstore from nucleuscontext, little endian

(See OSA2011)

4C16 ASI_CHDER RW 2016 N Core HangDetection EnableRegister

Ch17

4C16 ASI_DO_STATUS RO 3816 Y Disable OverlapStatus Register

Ch 21

4E16 ASI_SPARC_HW_CONFIG RW 816 N SPARC hardwareconfiguration register

20.1

8016 ASI_PRIMARY RW Any — Implicit primary addressspace

(See OSA2011)

8116 ASI_SECONDARY RW Any — Implicit secondaryaddress space

(See OSA2011)

8216 ASI_PRIMARY_NO_FAULT RO Any — Primary address space,no fault

(See OSA2011)

8316 ASI_SECONDARY_NO_FAULT

RO Any — Secondary address space,no fault

(See OSA2011)


8816 ASI_PRIMARY_LITTLE RW Any — Implicit primary addressspace (LE)

(See OSA2011)

8916 ASI_SECONDARY_LITTLE RW Any — Implicit secondaryaddress space (LE)

((See OSA2011)

8A16 ASI_PRIMARY_NO_FAULT_LITTLE

RO Any — Primary address space,no fault (LE)

(See OSA2011)

8B16 ASI_SECONDARY_NO_FAULT_LITTLE

RO Any — Secondary address space,no fault (LE)

(See OSA2011)

8C16–AF16 Any — DAE_invalid_asi

B016 ASI_PIC RW 016 PerformanceInstrumentation Counter0

10.3


10.3


10.3


10.3

B116–BF16 Any — DAE_invalid_asi

C016 ASI_PST8_P WO Any — Eight 8-bit conditionalstores, primary address

(See OSA2011)

C116 ASI_PST8_S WO Any — Eight 8-bit conditionalstores, secondary address

(See OSA2011)

C216 ASI_PST16_P WO Any — Four 16-bit conditionalstores, primary address

(See OSA2011)

C316 ASI_PST16_S WO Any — Four 16-bit conditionalstores, secondary address

(See OSA2011)




• 61

C416 ASI_PST32_P WO Any — Two 32-bit conditionalstores, primary address

(See OSA2011)

C516 ASI_PST32_S WO Any — Two 32-bit conditionalstores, secondary address

(See OSA2011)

C616–C716 Any — DAE_invalid_asi

C816 ASI_PST8_PL WO Any — Eight 8-bit conditionalstores, primary address,little endian

((See OSA2011)

C916 ASI_PST8_SL WO Any — Eight 8-bit conditionalstores, secondaryaddress, little endian

(See OSA2011)

CA16 ASI_PST16_PL WO Any — Four 16-bit conditionalstores, primary address,little endian

(See OSA2011)

CB16 ASI_PST16_SL WO Any — Four 16-bit conditionalstores, secondaryaddress, little endian

(See OSA2011)

CC16 ASI_PST32_PL WO Any — Two 32-bit conditionalstores, primary address,little endian

(See OSA2011)

CD16 ASI_PST32_SL WO Any — Two 32-bit conditionalstores, secondaryaddress, little endian

(See OSA2011)

CE16–CF16 Any — DAE_invalid_asi

D016 ASI_FL8_P RW Any — 8-bit load/store, primaryaddress

(See OSA2011)

D116 ASI_FL8_S RW Any — 8-bit load/store,secondary address

(See OSA2011)

D216 ASI_FL16_P RW Any — 16-bit load/store,primary address

(See OSA2011)

D316 ASI_FL16_S RW Any — 16-bit load/store,secondary address

(See UA 2007)

D416–D716 Any — DAE_invalid_asi

D816 ASI_FL8_PL RW Any — 8-bit load/store, primaryaddress, little endian

(See OSA2011)

D916 ASI_FL8_SL RW Any — 8-bit load/store,secondary address, littleendian

(See OSA2011)

DA16 ASI_FL16_PL RW Any — 16-bit load/store,primary address, littleendian

(See OSA2011)

DB16 ASI_FL16_SL RW Any — 16-bit load/store,secondary address, littleendian

(See OSA2011)

DC16–DF16 Any — DAE_invalid_asi

E016 ASI_BLK_COMMIT_PRIMARY RW Any — 64-byte block commitstore, primary address

5.5

E116 ASI_BLK_COMMIT_SECONDARY

RW Any — 64-byte block commitstore, secondary address

5.5





E216 ASI_TWINX_P,ASI_STBI_P

RW Any — Load: 128-bit atomic loadtwin extended word,primary address spaceStore: Block initializingstore, primary addressspace

(See OSA2011)

E316 ASI_TWINX_S,ASI_STBI_S

RW Any — Load: 128-bit atomic loadtwin extended word,secondary address spaceStore: Block initializingstore, secondary addressspace

(See OSA2011)

E416–E916 Any — DAE_invalid_asi

EA16 ASI_TWINX_PL,ASI_STBI_PL

RW Any — Load: 128-bit atomic loadtwin extended word,primary address space,little endianStore: Block initializingstore, primary addressspace, little endian

(See OSA2011)

EB16 ASI_TWINX_PL,ASI_STBI_PL

RW Any — Load: 128-bit atomic loadtwin extended word,secondary address space,little endianStore: Block initializingstore, secondary addressspace, little endian

(See OSA2011)

EC16–EF16 Any — DAE_invalid_asi

F016 ASI_BLK_P RW Any — 64-byte block load/store,primary address

5.5

F116 ASI_BLK_S RW Any — 64-byte block load/store,secondary address

5.5

F216 ASI_STBIMRU_PRIMARY RW Any Block initializing store toprimary, install as MRUin L2 cache

5.8.1

F316 ASI_STBIMRU_SECONDARY RW Any Block initializing store tosecondary, install asMRU in L2 cache

5.8.1

F416–F716 Any — DAE_invalid_asi

F816 ASI_BLK_PL RW Any — 64-byte block load/store,primary address (LE)

5.5

F916 ASI_BLK_SL RW Any — 64-byte block load/store,secondary address (LE)

5.5

FA16 ASI_STBIMRU_PRIMARY_LITTLE

WO Any Block initializing store toprimary little-endian,install as MRU in L2cache

5.8.1

FB16 ASI_STBIMRU_SECONDARY_LITTLE

WO Any Block initializing store tosecondary little-endian,install as MRU in L2cache

5.8.1

FC16–FF16 Any — DAE_invalid_asi




• 63

9.2.1 ASI_REAL, ASI_REAL_LITTLE, ASI_REAL_IO, andASI_REAL_IO_LITTLE

These ASIs are used to bypass the VA-to-RA translation. For these ASIs, the real address is set equalto the truncated virtual address (that is, RA{51:0} ← VA{51:0}), and the attributes used are thosepresent in the matching TTE. The hypervisor will normally set the TTE attributes for ASI_REAL andASI_REAL_LITTLE to cacheable (cp = 1) and for ASI_REAL_IO and ASI_REAL_IO_LITTLE tononcacheable, with side effect (cp = 0, e = 1). The hardware, however, does not require this, i.e. itallows an ASI_REAL/ASI_REAL_LITTLE to be issued to a noncacheable address (PA{47} = 1) or anASI_REAL_IO/ASI_REAL_IO_LITTLE to be issued to a cacheable address (PA{47} = 0); no error isflagged in this case.

9.2.2 ASI_SCRATCHPAD

Each virtual processor has a set of privileged ASI_SCRATCHPAD registers at ASI 2016 withVA{63:0} = 016–1816, 3016–3816. These registers are for scratchpad use by privileged software.

M5Implementation

Note

Accesses to VA 2016 and 2816 are much slower than to the othersix scratchpad registers.


9.2.3 ASI Accessible Shared RegistersThere are a number of ASI addressable registers which are shared by all cores. These registers arelocated outside the cores, and are mapped to the CMT region of IO space (PA[31:28] = 4’b1111). Pleaserefer to Section 16.2.1 for details.

9.2.4 Block Initializing Store ASIs

Description Block initializing store ASIs can be selected for use in integer or floating-point store instructions.These ASIs allow block initializing stores to be performed to the same address spaces as normalstores. Little-endian ASIs access data in little-endian format, otherwise the access is assumed to bebig-endian.

Integer and floating-point stores of all sizes (to alternate space) are allowed to use these ASIs.

Instruction imm_asiASI

Value Operation

ST[B,H,W,TW,X]A ASI_ST_BLKINIT_AS_IF_USER_PRIMARY(ASI_STBI_AIUP)

2216 64-byte block initializing store to primaryaddress space, user privilege

ASI_ST_BLKINIT_AS_IF_USER_SECONDARY(ASI_STBI_AIUS)

2316 64-byte block initializing store to secondaryaddress space, user privilege

ASI_ST_BLKINIT_REAL(ASI_STBI_R)

2616 64-byte block initializing store to realaddress

ASI_ST_BLKINIT_NUCLEUS(ASI_STBI_N)

2716 64-byte block initializing store to nucleusaddress space

ASI_ST_BLKINIT_AS_IF_USER_PRIMARY_LITTLE(ASI_STBI_AIUPL)

2A16 64-byte block initializing store to primaryaddress space, user privilege, little-endian

ASI_ST_BLKINIT_AS_IF_USER_SECONDARY_LITTLE(ASI_STBI_AIUS_L)

2B16 64-byte block initializing store to secondaryaddress space, user privilege, little-endian

ASI_ST_BLKINIT_REAL_LITTLE(ASI_STBI_RL)

2E16 64-byte block initializing store to realaddress, little-endian

ASI_ST_BLKINIT_NUCLEUS_LITTLE(ASI_STBI_NL)

2F16 64-byte block initializing store to nucleusaddress space, little-endian

ASI_ST_BLKINIT_PRIMARY(ASI_STBI_P)

E216 64-byte block initializing store to primaryaddress space

ASI_ST_BLKINIT_SECONDARY(ASI_STBI_S)

E316 64-byte block initializing store to secondaryaddress space

ASI_ST_BLKINIT_PRIMARY_LITTLE(ASI_STBI_PL)

EA16 64-byte block initializing store to primaryaddress space, little-endian

ASI_ST_BLKINIT_SECONDARY_LITTLE(ASI_STBI_SL)

EB16 64-byte block initializing store to secondaryaddress space, little-endian

ASI_ST_BLKINIT_MRU_PRIMARY(ASI_STBIMRU_P)

F216 64-byte block initializing store to primaryaddress space, install as MRU in L2 cache

ASI_ST_BLKINIT_MRU_SECONDARY(ASI_STBIMRU_S)

F316 64-byte block initializing store to secondaryaddress space, install as MRU in L2 cache

ASI_ST_BLKINIT_MRU_PRIMARY_LITTLE(ASI_STBIMRU_PL)

FA16 64-byte block initializing store to primaryaddress space, little-endian,install as MRU in L2 cache

ASI_ST_BLKINIT_MRU_SECONDARY_LITTLE(ASI_STBIMRU_SL)

FB16 64-byte block initializing store to secondaryaddress space, little-endian,install as MRU in L2 cache

• 65

All stores to these ASIs operate under relaxed memory ordering (RMO). To ensure ordering withrespect to subsequent stores and loads, software must follow a sequence of these stores with aMEMBAR #StoreStore or #StoreLoad, respectively. To ensure ordering with respect to priorstores, software must precede these stores with a MEMBAR #StoreStore.

Stores to these ASIs where the least-significant 5 bits of the address are non-zero (that is, not the firstword in the L2 cache line) behave the same as a normal RMO store. A store to these ASIs where theleast-significant 5 bits are zero will load a line in the L2 cache with all zeros, and then update that linewith the new store data. A store to these ASIs where the least-significant 6 bits are zero will load thefirst line (bit 4 equal to 0) in the L2 cache with all zeros, and then update that line with the new storedata. The second 32B line may or may not be initialized to 0 prior to being established in the L2 cache.If the second 32B line is not initialized to 0, it is copied into the L2 cache using the current value fromthe L3 cache or memory. This special store will make sure the 32B lines maintain coherency when theyare loaded into the L2 cache, but will not generally fetch the line from L3 cache or memory(initializing it with zeros instead), except as noted above. Stores using these ASIs to a noncacheableaddress behave the same as a normal store.

The ASIs F216, F316, FA16, and FB16 operate as described above, but establish the line in the L2 cache asmost-recently-used (MRU), thereby helping to ensure they are not replaced shortly after beingestablished. This can aid in cases where the newly-established line is expected to be referenced in thenear future from a process running on the same physical core.

The following pseudocode shows how these ASIs can be used to do a quadword-aligned (on bothsource and destination) copy of N quadwords from A to B (where N > 3). Note that the final 64 bytesof the copy is performed using normal stores, guaranteeing that all initial zeros in a cache line areoverwritten with copy data. This pseudocode may not be optimal for SPARC M5; it is provided as anexample only.

%l0 ← [A]%l1 ← [B]prefetch [%l0]for (i = 0; i < N-4; i++) {

if ((i mod 4) ≠ 0) { prefetch [%l0+64] } ldtxa [%l0] #ASI_TWINX_P, %l2 add %l0, 16, %l0 stxa %l2, [%l1] #ASI_ST_BLKINIT_PRIMARY add %l1, 8, %l1 stxa %l3, [%l1] #ASI_ST_BLKINIT_PRIMARY add %l1, 8, %l1}for (i = 0; i < 4; i++) { ldtxa [%l0] #ASI_TWINX_P, %l2 add %l0, 16, %l0 stx %l2, [%l1] stx %l3,d [%l1+8] add %l1, 16, %l1}membar #Sync

Note These instructions are used for transferring large blocks of data(more than 256 bytes); for example, memcpy() and memset().On SPARC M5, a twin load forces a miss in the primary cacheand will not allocate a line in the primary cache, but doesallocate in L2.


ProgrammingNotes

The Block Initializing Store ASIs are of Class "N" and are onlyallowed in dynamically linked, platform-specific, OS-enabledlibraries.

• 67

CHAPTER 10

Performance Instrumentation

10.1 IntroductionAs in previous UltraSPARC CMT processors such as UltraSPARC T1, UltraSPARC T2, UltraSPARCT2+, and SPARC T3, SPARC M5 supports monitoring processor performance by virtue of a set ofperformance counters. SPARC M5 expands on the capabilities of previous UltraSPARC CMTprocessors by adding more counters per virtual processor and by being able to measure additionalprocessor and pipeline events. Significant differences from SPARC T3 are as follows:

1. SPARC M5 supports 4 counters (PICs) per virtual processor instead of two.

2. Each PIC is controlled via a dedicated PCR. Each PCR controls only one PIC.

3. The format of the PCR has changed significantly.

4. Access to the PCRs is via hyperprivileged ASIs only, instead of ASRs. The hypervisor can thenpermit privileged and user access only to the PICs via PCR.picnht and PCR.picnpt, respectively.The PCRs can thus be allocated to hypervisor, supervisor, or user code in any combination. Thisalso enables virtualization of performance counter and measurement infrastructure to ease futuredevelopment as processor architecture evolves.

5. Access to the PICs is via non-privileged ASIs only, instead of ASRs. Access is only granted basedupon the settings of PCR.picnht and PCR.picnpt as described above.

6. The pic_overflow trap no longer exists. Instead, a PIC which overflows due to a preciseperformance event generates a precise_performance_event trap, and a PIC which overflows due toan asynchronous performance event generate a disrupting_performance_event trap. Neither trapsets SOFTINT{15}.

7. Precise performance counter overflows have no skid.

10.2 SPARC Performance Control RegistersEach virtual processor has four hyperprivileged, read/write Performance Control registers: PCR0,PCR1, PCR2, and PCR3. Each PCR controls its corresponding PIC: PCR0 controls PIC0, PCR1 controlsPIC1, PCR2 controls PIC2, and PCR3 controls PIC3. Each Performance Control register contains tenfields: ntc, picnht, picnpt, sl, mask, ht, ut, st, toe, and ov. All bits except ntc and ov are always updatedon a Performance Control register write. ov is a state bit associated with PIC overflow traps and isprovided to allow software to determine whether a PIC counter has overflowed. ntc is also a state bitassociated with PIC overflow traps that allows software to handle a special case on aprecise_performance_event trap: TPC and TNPC point to the instruction which caused the overflow,but hardware already executed the instruction at TPC. In this case software must execute a DONE

69

instead of a RETRY. ntc and ov can be reset by software but can never be written to 1. sl controlswhich events are counted in a PIC. mask is used in conjunction with sl to determine which set ofsubevents are counted in a PIC. toe controls whether a trap is generated when the PIC counteroverflows. ut controls whether user-level events are counted. st controls whether supervisor-levelevents are counted. ht controls whether hypervisor level events are counted.The format of this registeris shown in TABLE 10-1. Note that changing a field in the PCR does not directly affect a PIC value. Toreliably change the events being monitored, software should perform the following sequence:

1. Disable counting by writing zeroes to PCR.sl and clearing PCR.ut, PCR.ht, and PCR.st.

2. Reset the PIC.

3. Enable the new event via writing a non-zero value to PCR.sl and setting PCR.ut, PCR.ht, or PCR.st,as appropriate.

TABLE 10-1 Performance Control Registers – PCR0-3 (ASI 6416, VA 0016, 0816, 1016, 1816)

Bit FieldInitialValue R/W Description


18 ntc 0 RW Set to 1 when PIC wraps from 232 –1 to 0 on a next-to-commit (ntc)instruction1. Once set, ntc remains set until reset by software.Hardware sets ntc whenever it sets ov on a next-to-commitinstruction.

17 picnht 0 RW PIC non-hyperprivileged trap. Privileged software can access thePIC only if picnht = 0, otherwise a privileged_action trap occurs.Non-privileged software can access PIC only when picnht = 0 andpicnpt = 0, otherwise a privileged_action trap occurs.

16 picnpt 0 RW PIC non-privileged trap. Non-privileged software can access PIConly when picnht = 0 and picnpt = 0, otherwise a privileged_actiontrap occurs.

15:11 sl 0 RW Selects one of 32 events to be counted for PIC as per the followingtable.

10:5 mask 0 RW Mask event for PIC as listed in TABLE 10-2.

4 ht 0 RW If ht = 1, count events in hyperprivileged mode; otherwise, ignorehyperprivileged mode events.

3 st 0 RW If st = 1, count events in privileged mode; otherwise, ignoreprivileged mode events.

2 ut 0 RW If ut = 1, count events in user mode; otherwise, ignore user modeevents.

1 toe 0 RW Trap-on-Event: This field controls whether a precise trap(precise_performance_event) or disrupting trap(disrupting_performance_event) to hyperprivileged software occursif the corresponding PIC counter overflows. Hardware ANDs thevalue of toe with ov to produce a trap. Events in certain eventgroups (those marked as Precise in TABLE 10-2) generate a preciseprecise_performance_event trap, assuming that PCR.toe = 1 andPCR.ht = 0 — TPC will contain the address of an instruction thatgenerated the counter overflow event2. Events in other eventgroups are not directly related to the instruction stream and anoverflow for one of the asynchronous events generates adisrupting_performance_event trap; therefore, the TPC may besome number of instructions later than when the overflow eventoccurred.

0 ov 0 RW Set to 1 when PIC wraps from 232 –1 to 0. Once set, ov remains setuntil reset by software.


TABLE 10-2 describes the settings of the sl field. Most sl fields have a mask associated with them.Setting multiple mask bits at the same time can lead to multiple events being counted as one event.Some sl groups do not use all of the mask bits; setting unused mask bits has no effect. More details aredescribed in TABLE 10-2.

10.3 SPARC Performance Instrumentation CounterEach virtual processor has four Performance Instrumentation Counter registers: PIC0, PIC1, PIC2,and PIC3. PCR0 controls PIC0, PCR1 controls PIC1, PCR2 controls PIC2, and PCR3 controls PCR3.Access privilege is controlled by the settings of PCR.picnht and PCR.picnpt. When PCR.picnht = 1, anattempt to access a PIC register in privileged or nonprivileged mode will cause a privileged_actiontrap. When PCR.picnpt = 1 an attempt to access this register in nonprivileged mode causes aprivileged_action trap.

The PIC counter contains a single 32-bit counter field. The field counts the event selected by PCR.sl.The ut, st, and ht fields for PCR control which combination of user, supervisor, and/or hypervisorevents are counted.

Performance counter overflows a) set PCR.ov, and b) generate a hyperprivileged trap if PCR.toe is set.Which trap is generated depends upon whether the event being counted is synchronous orasynchronous, as denoted in TABLE 10-2 above. If the event is asynchronous, adisrupting_performance_event trap is generated; otherwise, a precise_performance_event trap isgenerated. For precise traps, the instruction that caused the overflow will not have been executed,and the PC and NPC of the instruction will be captured on the trap stack, with the following caveat.The precise_performance_event trap is delivered precisely to hypervisor.

The format of the PIC registers is shown in TABLE 10-2.

1. The following instructions are next-to-commit instructions: MD5, SHA1, SHA256, SHA512, MPMUL, MONT-MUL, MONTSQR, loads and stores to I/O space, CAS{X}A, LDSTUB, SWAP, WRHPR, WRASR, WRPR,RDHPR, RDPR, RDASR instructions, and any non-translating load or store alternate instruction as defined inTable 9-3, “UltraSPARC {YF(VT40)} ASI Usage,” on page 163. When hardware takes aprecise_performance_event trap on a next-to-commit instruction, the instruction has already been executed.Therefore, trap handler software should execute a DONE instruction; it must not execute a RETRY instruction.Software can examine the ntc bit to determine whether to execute a DONE or a RETRY instruction.

2.

TABLE 10-2 Performance Instrumentation Counter Register – PIC0-3 (ASI B016, VA 0016, 0816, 1016, 1816)

Bit Field Initial Value R/W Description

63:32 — 0 RW Reserved

31:0 counter 0 RW Programmable event counter, event controlled by PCR.sl.

• 71

CHAPTER 11

Implementation Dependencies

11.1 SPARC V9 General Information

11.1.1 Level-2 Compliance (Impdep #1)SPARC M5 is designed to meet Level-2 SPARC V9 compliance. It

■ Correctly interprets all nonprivileged operations, and

■ Correctly interprets all privileged elements of the architecture.

11.1.2 Unimplemented Opcodes, ASIs, and ILLTRAPSPARC V9 unimplemented, reserved, ILLTRAP opcodes, and instructions with invalid values inreserved fields (other than reserved FPops) encountered during execution cause an illegal_instruction trap.Unimplemented and reserved ASI values cause a DAE_invalid_ASI trap.

11.1.3 Trap Levels (Impdep #37, 38, 39, 40, 114, 115)SPARC M5 supports two trap levels; that is, MAXPTL = 2. Normal execution is at TL = 0.

A virtual processor normally executes at trap level 0 (execute_state, TL = 0). Per SPARC V9, a trapcauses the virtual processor to enter the next higher trap level, which is a very fast and efficientprocess because there is one set of trap state registers for each trap level. After saving the mostimportant machine states (PC, NPC, PSTATE) on the trap stack at this level, the trap (or error)condition is processed.

11.1.4 Trap Handling (Impdep #16, 32, 33, 35, 36, 44)SPARC M5 supports precise trap handling for all operations except for deferred and disrupting trapsfrom hardware failures and interrupts. SPARC M5 implements precise traps, interrupts, andexceptions for all instructions, including long-latency floating-point operations. Multiple traps levelsare supported, allowing graceful recovery from faults. SPARC M5 can efficiently execute kernel codeeven in the event of multiple nested traps, promoting strand efficiency while dramatically reducingthe system overhead needed for trap handling.

Note System emulation routines (for example, quad-precisionfloating-point operations) shipped with SPARC M5 also must beLevel-2 compliant.

73

Multiple sets of global registers are provided. This further increases OS performance, providing fasttrap execution by avoiding the need to save and restore registers while processing exceptions.

All traps supported in SPARC M5 are listed in TABLE 6-2 on page 49.

11.1.5 Secure SoftwareTo establish an enhanced security environment, it may be necessary to initialize certain virtualprocessor states between contexts. Examples of such states are the contents of integer and floating-point register files, condition codes, and state registers. See also Clean Window Handling (Impdep #102).

11.1.6 Address Masking (Impdep #125)SPARC M5 follows Oracle SPARC Architecture 2011 for PSTATE.am masking. Addresses to non-translating ASIs, *REAL* ASIs, and accesses that bypass translation are never masked.

11.2 SPARC V9 Integer Operations

11.2.1 Integer Register File and Window Control Registers(Impdep #2)SPARC M5 implements an eight-window 64-bit integer register file; that is, N_REG_WINDOWS = 8.SPARC M5 truncates values stored in the CWP, CANSAVE, CANRESTORE, CLEANWIN, andOTHERWIN registers to three bits. This includes implicit updates to these registers by SAVE, SAVED,RESTORE, and RESTORED instructions. The most significant two bits of these registers read as zero.

11.2.2 Clean Window Handling (Impdep #102)SPARC V9 introduced the concept of “clean window” to enhance security and integrity duringprogram execution. A clean window is defined to be a register window that contains either all zeroesor addresses and data that belong to the current context. The CLEANWIN register records the numberof available clean windows.

When a SAVE instruction requests a window and there are no more clean windows, a clean_windowtrap is generated. System software needs to clean one or more windows before returning to therequesting context.

11.2.3 Integer Multiply and DivideInteger multiplications (MULScc, SMUL{cc}, MULX) and divisions (SDIV{cc}, UDIV{cc}, UDIVX) areexecuted directly in hardware.


11.2.4 MULSccSPARC V9 does not define the value of xcc and rd{63:32] for MULScc. SPARC M5 sets xcc.n to 0, xcc.zto 1 if rd{63:0} is zero and to 0 if rd{63:0} is not zero, xcc.v to 0, and xcc.c to 0. SPARC M5 sets rd{63:33}to zeros, and sets rd{32} to icc.c (that is, rd{32} is set if there is a carry-out of rd{31}; otherwise, it iscleared).

11.3 SPARC V9 Floating-Point Operations

11.3.1 Overflow, Underflow, and Inexact Traps (Impdep #3, 55)SPARC M5 implements precise floating-point exception handling. Tininess, as it pertains to underflowis detected before rounding.

11.3.2 Quad-Precision Floating-Point Operations (Impdep #3)All quad-precision floating-point instructions, listed in TABLE 11-1, cause an illegal_instruction trap.These operations are then emulated by system software.

TABLE 11-1 Unimplemented Quad-Precision Floating-Point Instructions

Instruction Description

F<s|d>TOq Convert single-/double- to quad-precision floating-point.

F<i|x>TOq Convert 32-/64-bit integer to quad-precision floating-point.

FqTO<s|d> Convert quad- to single-/double-precision floating-point.

FqTO<i|x> Convert quad-precision floating-point to 32-/64-bit integer.

FCMP<E>q Quad-precision floating-point compares.

FMOVq Quad-precision floating-point move.

FMOVqcc Quad-precision floating-point move if condition is satisfied.

FMOVqr Quad-precision floating-point move if register match condition.

FABSq Quad-precision floating-point absolute value.

FADDq Quad-precision floating-point addition.

FDIVq Quad-precision floating-point division.

FdMULq Double- to quad-precision floating-point multiply.

FMULq Quad-precision floating-point multiply.

FNEGq Quad-precision floating-point negation.

FSQRTq Quad-precision floating-point square root.

FSUBq Quad-precision floating-point subtraction.

• 75

11.3.3 Floating-Point Upper and Lower Dirty Bits in FPRSRegisterThe FPRS_dirty_upper (du) and FPRS_dirty_lower (dl) bits in the Floating-Point Registers State(FPRS) register are set when an instruction that modifies the corresponding upper or lower half of thefloating-point register file is issued. Floating-point register file modifying instructions includefloating-point operate, graphics, floating-point loads and block load instructions.

SPARC V9 allows FPRS.du and FPRS.dl to be set pessimistically. SPARC M5 sets FPRS.du or FPRS.dleither when an instruction that updates the floating-point register file successfully completes, or whenan FMOVcc or FMOVr that does not meet the condition successfully completes.

11.3.4 Floating-Point Status Register (FSR) (Impdep #13, 19, 22, 23,24)SPARC M5 supports precise-traps and implements all three exception fields (tem, cexc, and aexc)conforming to IEEE Standard 754-1985.

SPARC M5 implements the FSR register according to the definition in Oracle SPARC Architecture2011, with the following implementation-specific clarifications:

■ SPARC M5 does not contain an FQ, therefore FSR.qne always reads as 0 and an attempt to read theFQ with an RDPR instruction causes an illegal_instruction trap.

■ SPARC M5 does not detect the unimplemented_FPop, unfinished_FPop, sequence_error,hardware_error, or invalid_fp_register floating-point trap types directly in hardware, thereforedoes not generate a trap when those conditions occur.

TABLE 11-2 documents the fields of the FSR.

TABLE 11-2 Floating-Point Status Register Format

Bits Field RW Description


37:36 fcc3 RW Floating-point condition code (set 3). One of four sets of 2-bit floating-pointcondition codes, which are modified by the FCMP{E} (and LD{X}FSR)instructions. The FBfcc, FMOVcc, and MOVcc instructions use one of thesecondition code sets to determine conditional control transfers and conditionalregister moves.Note: fcc0 is the same as the FCC in SPARC V8.

35:34 fcc2 RW Floating-point condition code (set 2). See fcc3 description.

33:32 fcc1 RW Floating-point condition code (set 1) See fcc3 description.

31:30 rd RW IEEE Std. 754-1985 rounding direction, as follows:


27:23 tem RW IEEE-754 trap enable mask. Five-bit trap enable mask for the IEEE-754 floating-point exceptions. If a floating-point operate instruction produces one or moreexceptions, the corresponding cexc/aexc bits are set and anfp_exception_ieee_754 (with FSR.ftt = 1, IEEE_754_exception) exception isgenerated.

rd Round Toward

0 Nearest (even if tie)

1 02 +∞3 –∞


11.4 SPARC V9 Memory-Related Operations

11.4.1 Load/Store Alternate Address Space (Impdep #5, 29, 30)Supported ASI accesses are listed in Section 9.3.

11.4.2 Read/Write ASR (Impdep #6, 7, 8, 9, 47, 48)Supported ASRs are listed in Chapter 3, Registers.

22 ns RO Nonstandard floating-point results. SPARC M5 does not implement a non-standard floating-point mode. FSR.ns always reads as 0, and writes to it areignored.


19:17 ver RO FPU version number. This field identifies a particular implementation of theSPARC M5 FPU architecture.

16:14 ftt RW Floating-point trap type. Set whenever a floating-point instruction causes thefp_exception_ieee_754 or fp_exception_other traps. Values are as follows:

Note: SPARC M5 neither detects nor generates the unimplemented_FPop,unfinished_FPop, sequence_error, hardware_error or invalid_fp_register traptypes directly in hardware.Note: SPARC M5 does not contain an FQ. An attempt to read the FQ with anRDPR instruction causes an illegal_instruction trap.

13: qne RW Floating-point deferred-trap queue (FQ) not empty. Not used, because SPARCM5 implements precise floating-point exceptions.

12 — RO Reserved

11:10 fcc0 RW Floating-point condition code (set 0). See fcc3 description.

9:5 aexc RW Accumulated outstanding exceptions. Accumulates IEEE 754 exceptions whilefloating-point exception traps are disabled (that is, while corresponding bit inFSR.tem is zero)

4:0 cexc RW Current outstanding exceptions. Indicates the most recently generated IEEE 754exceptions.

TABLE 11-2 Floating-Point Status Register Format (Continued)

Bits Field RW Description

ftt Floating-Point Trap Type Trap Signalled

0 None —

1 IEEE_754_exception fp_exception_ieee_754

2 reserved —

3 reserved —

4 reserved —

5 reserved —

6 invalid_fp_register fp_exception_other

7 reserved —

• 77

11.4.3 MMU Implementation (Impdep #41)SPARC M5 memory management is based on in-memory Translation Storage Buffers (TSBs) backed bya Software Translation Table. See Chapter 13, Memory Management Unit for more details.

11.4.4 FLUSH and Self-Modifying Code (Impdep #122)FLUSH is needed to synchronize code and data spaces after code space is modified during programexecution. FLUSH is described in Section D.3.4. On SPARC M5, the FLUSH effective address isignored, and as a result, FLUSH cannot cause a DAE_invalid_ASI trap.

11.4.5 PREFETCH{A} (Impdep #103, 117)For SPARC M5, PREFETCH{A} instructions follow TABLE 11-3 based on the fcn value. See Section 5.3,PREFETCH/PREFETCHA, on page 35 for more detail.

11.4.6 LDD/STD Handling (Impdep #107, 108)LDD and STD instructions are directly executed in hardware.

Note SPARC V9 specifies that the FLUSH instruction has no latencyon the issuing virtual processor. In other words, a store toinstruction space prior to the FLUSH instruction is visibleimmediately after the completion of FLUSH. When a flush isperformed, SPARC M5 guarantees that earlier codemodifications will be visible across the whole system.

TABLE 11-3 PREFETCH{A} Variants in SPARC M5

fcn Prefetch Function Action

016 Weak prefetch for several reads Prefetch to L1 data cache and Level 2 cache.

116 Weak prefetch for one read Prefetch to L2 cache.

216 Weak prefetch for several writes Prefetch to L2 cache (exclusive)

316 Weak prefetch for one write Prefetch to L2 cache (exclusive)

416 Prefetch Page No operation.

516–F16 — Illegal_instruction trap.

1016 NOP NOP - no action taken.

1116 Strong prefetch to nearest unified cache Prefetch into Level 2 cache.

1216–1316 NOP NOP - no action taken.

1416 Strong prefetch for several reads Prefetch to L1 data cache and Level 2 cache.

1516 Strong prefetch for one read Prefetch to L2 cache.

1616 Strong prefetch for several writes Prefetch to L2 cache (exclusive)

1716 Strong prefetch for one write Prefetch to L2 cache (exclusive)

1816–1F16 NOP No operation


11.4.7 FP mem_address_not_aligned (Impdep #109, 110, 111, 112)LDDF{A}/STDF{A} cause an LDDF_/STDF_ mem_address_not_aligned trap if the effective addressis 32-bit aligned but not 64-bit (doubleword) aligned.

LDQF{A}/STQF{A} are not directly executed in hardware; they cause an illegal_instruction trap.

11.4.8 Supported Memory Models (Impdep #113, 121)SPARC M5 supports only the TSO memory model, although certain specific operations such as blockloads and stores operate under the RMO memory model. See Chapter 8, Section 8.2. SupportedMemory Models.”.

11.4.9 Implicit ASI When TL > 0 (Impdep #124)SPARC M5 matches all Oracle SPARC Architecture implementations and makes the implicit ASI forinstruction fetching ASI_NUCLEUS when TL > 0, while the implicit ASI for loads and stores when TL> 0 is ASI_NUCLEUS if PSTATE.cle=0 or ASI_NUCLEUS_LITTLE if PSTATE.cle=1.

11.5 Non-SPARC V9 Extensions

11.5.1 Cache SubsystemSPARC M5 contains one or more levels of cache. The cache subsystem architecture is described inAppendix D, Cache Coherency and Ordering.

11.5.2 Block Memory OperationsSPARC M5 supports 64-byte block memory operations utilizing a block of eight double-precisionfloating point registers as a temporary buffer. See Section 5.5.

11.5.3 Partial StoresSPARC M5 supports 8-/16-/32-bit partial stores to memory. See Section 5.5.

11.5.4 Short Floating-Point Loads and StoresSPARC M5 supports 8-/16-bit loads and stores to the floating-point registers.

Note LDD/STD are deprecated in SPARC V9. In SPARC M5 it is moreefficient to use LDX/STX for accessing 64-bit data. LDD/STDtake longer to execute than two 32- or 64-bit loads/stores.

• 79

11.5.5 Load Twin Extended WordSPARC M5 supports 128-bit atomic load operations to a pair of integer registers.

11.5.6 SPARC M5 Instruction Set Extensions (Impdep #106)The SPARC M5 processor supports VIS 3.0. VIS instructions are designed to enhance graphicsfunctionality and improve the efficiency of memory accesses.

Unimplemented IMPDEP1 and IMPDEP2 opcodes encountered during execution cause anillegal_instruction trap.

Other instruction extensions are described in Chapter 3, Registers.

11.5.7 Performance InstrumentationSPARC M5 performance instrumentation is described in Chapter 10, Performance Instrumentation.


CHAPTER 12

Cryptographic Extensions

Oracle SPARC Architecture CMT processors have always provided hardware support for a range ofcryptographic operations. In UltraSPARC T1, UltraSPARC T2, and SPARC T3 contained discrete,hyperprivileged, per-core accelerators. However, the software overheads associated with using theseaccelerators can be somewhat problematic for small cryptographic operations.

SPARC M5 dispenses with per-core discrete cryptographic accelerators, and provides cryptographicsupport via non-privileged instructions. The instructions accelerate bulk ciphers, secure hashes, andpublic-key algorithms. Since these instructions are non-privileged, they can be used directly byapplications, or by commonly used open source cryptographic libraries such as OpenSSL. In doing so,SPARC M5 eliminates software overhead associated with discrete cryptographic accelerators.

In SPARC M5, symmetric ciphers are implemented such that a single instruction is capable ofperforming a significant portion of a round. Secure hashes are implemented such that a singleinstruction performs a single block of the hash operation (i.e. multiple rounds). Public-key operationsare accelerated via instructions that perform large (up to 2048-bit) Montgomery multiplicationoperations. More details on these instructions can be found in Chapter 5, Instruction Definitions.

The SPARC M5 cryptographic extensions have been designed such that future UltraSPARC processorscan drop support for older, deprecated ciphers (and introduce support for new ones) by reclaimingopcodes previously reserved for old ciphers. This is achieved by the introduction of the CompatibilityFeature Register (CFR).

12.1 CFR RegisterThe CFR is described in Chapter 3, Registers.

12.2 Cryptographic InstructionsSPARC M5 introduces a number of new cryptographic opcodes, which are detailed in Chapter 5,Instruction Definitions.

12.3 Cryptographic performanceFor a single-thread executing on a core, the basic low-level performance on SPARC M5 is detailed inthe following tables.

81

12.4 Core S3 Crypto Coding GuidanceIt is anticipated that the SPARC M5 cryptographic instructions will be widely deployed - not only inSolaris libraries, but also in Open Source libraries like OpenSSL. Implementation of key cryptographicalgorithms using these instructions is very straight-forward, and example use is provided in theinstructions chapter. It is important that software use the CFR as detailed in Section 3.2.8,Compatibility Feature Register (CFR), on page 21, or software may perform sub-optimally on futureprocessors.

TABLE 12-1 Symmetric-key performance

Algorithm Block Size (Bytes) Block Latency (Cycles)

DES-ECB 8

3DES-ECB 8

AES-128-ECB 16

AES-192-ECB 16

AES-256-ECB 16

Kasumi

Camellia

TABLE 12-2 Secure hash performance

Algorithm Block Size (Bytes) Block Latency (Cycles)

MD5 64 186

SHA-1 64 220

SHA-256 64 188

SHA-512 128 236

TABLE 12-3 Public-key performance

Algorithm Operation Latency (cycles)

RSA1024(sign) TBD

RSA2048(sign) TBD

CompatibilityNote

(-- need to fill in with text about compatibility with legacy software, orexpected compatibility going forward ... use of the CFR register, etc ---)


CHAPTER 13

Memory Management Unit

This chapter provides detailed information about the SPARC M5 Memory Management Unit. Itdescribes the internal architecture of the MMU and how to program it.

13.1 Translation Table Entry (TTE)The Translation Table Entry holds information for a single page mapping. The TTE is broken into two64-bit words, representing the tag and data of the translation. Just as in a hardware cache, the tag isused to determine whether there is a hit in the TSB.

TABLE 13-1 shows the Oracle SPARC Architecture 2011 TTE tag format, modified to support 5 pagesizes, as interpreted by SPARC M5.

The sun4v TTE data format is shown in TABLE 13-2.

TABLE 13-1 TTE Tag Format

Bit Field Description

63:48 context The 16-bit context identifier associated with the TTE.

47:42 0 Must be 0

41:0 va Virtual Address Tag{63:22}. The virtual page number. Bits 21 through13 are not maintained in the tag, since these bits are used to index thesmallest TSB (512 entries).NOTE: SPARC Core S3 hardware only supports a 52-bit VA.

TABLE 13-2 TTE Data Format


63 v Valid. If the Valid bit is set, the remaining fields of the TTE are meaningful.

62 nfo No-fault-only. If this bit is set, loads with ASI_PRIMARY_NO_FAULT{_LITTLE},ASI_SECONDARY_NO_FAULT{_LITTLE} are translated. Any other DMMU access willtrap with a DAE_nfo_page trap. For the IMMU, if the nfo bit is set, an iae_nfo_pagetrap will be taken.

61:56 soft2 soft2 and soft are software-defined fields, provided for use by the operating system.Software fields are not implemented in the SPARC M5 TLB. soft and soft2 fields maybe written with any value; they read from the TLB as zero, with the exception ofsoft{61}, which contains the TLB data parity bit.

55:13 ra The real page1 number. For SPARC M5, a 48-bit real address range is supported by thehardware tablewalker, and bits {55:48} should always be zero.

83

12 ie Invert endianess. If this bit is set, accesses to the associated page are processed withinverse endianness from what is specified by the instruction (big-for-little and little-for-big). For the IMMU, the ie bit in the TTE is written into the ITLB but ignoredduring ITLB operation. The value of the ie bit written into the ITLB will be read out onan ITLB Data Access read.Note: This bit is intended to be set primarily for noncacheable accesses.

11 e Side effect. If this bit is set, noncacheable memory accesses other than block loads andstores are strongly ordered against other e bit accesses, and noncacheable stores arenot merged. This bit should be set for pages that map I/O devices having side effects.Note, however, that the e bit does not prevent normal instruction prefetching. For theIMMU, the e bit in the TTE is written into the ITLB, but ignored during ITLBoperation. The value of the e bit written into the ITLB will be read out on an ITLB DataAccess read.NOTE: The e bit does not force an uncacheable access. It is expected, but not required,that the cp and cv bits will be set to zero when the e bit is set.

10:9 cp, cv The cacheable-in-physically-indexed-cache and cacheable-in-virtually-indexed-cache(cp, cv) bits determine the placement of data in SPARC M5 caches, according toTABLE 13-3. The MMU does not operate on the cacheable bits, but merely passes themthrough to the cache subsystem. The cv bit is ignored by SPARC M5, is not written intothe TLBs, and returns zero on a Data Access read.

8 p Privileged. If the p bit is set, only privileged software can access the page mapped bythe TTE. If the p bit is set and an access to the page is attempted whenPSTATE.priv = 0, the MMU will signal an IAE_privilege_violation orDAE_privilege_violation trap.

7 ep Executable. If the ep bit is set, the page mapped by this TTE has execute permissiongranted. Otherwise, execute permission is not granted and the hardware table-walkerwill not load the ITLB with a TTE with ep = 0. For the IMMU and DMMU, the ep bitin the TTE is not written into the TLB. It returns one on a Data Access read for theITLB and zero on a Data Access read for the DTLB.

TABLE 13-2 TTE Data Format (Continued)


TABLE 13-3 Cacheable Field Encoding (from TSB)

Cacheable(cp:cv)

Meaning of TTE When Placed in:

iTLB(I-cache PA-Indexed)

dTLB(D-cache PA-Indexed)

0xCacheable in L2 and L3

caches onlyCacheable in L2 and L3 caches only

1xCacheable in L3 cache, L2

cache, and I-cacheCacheable in L3 cache, L2 cache, and

D-cache


13.2 Translation Storage Buffer (TSB)A TSB is an array of TTEs managed entirely by software. It serves as a cache of the SoftwareTranslation table

A TSB is arranged as a direct-mapped cache of TTEs.

The TSB exists as a normal data structure in memory and therefore may be cached. This policy mayresult in some conflicts with normal instruction and data accesses, but the dynamic sharing of thelevel-2 cache resource should provide a better overall solution than that provided by a fixedpartitioning.

FIGURE 13-1 shows the TSB organization. The constant N is determined by the size field in the TSBregister; it may range from 512 entries to 16 M entries.

FIGURE 13-1 TSB Organization

6 w Writable. If the w bit is set, the page mapped by this TTE has write permission granted.Otherwise, write permission is not granted and the MMU will cause a trap if a write isattempted. For the IMMU, the w bit in the TTE is written into the ITLB, but ignoredduring ITLB operation. The value of the w bit written into the ITLB will be read out onan ITLB Data Access read.

5:4 soft (see soft2, above)

3:0 size The page size of this entry, encoded as shown in TABLE 13-4.

1. sun4v supports translation from virtual addresses (VA) to real addresses (RA) to physical addresses (PA).Privileged code manages the VA-to-RA translations.

TABLE 13-2 TTE Data Format (Continued)


TABLE 13-4 Size Field Encoding (from TTE)

Size{3:0} Page Size

0000 8 KB0001 64 KB0010 Reserved0011 4 MB0100 Reserved0101 256 MB0110 2 GB0111-1111 Reserved

Tag1 (8 bytes) Data1 (8 bytes)

000016 000816

TagN (8 bytes) DataN (8 bytes)

N Lines in TSB

• 85

13.3 MMU-Related Faults and Traps

13.3.1 IAE_privilege_violation TrapThe I-MMU detects a privilege violation for an instruction fetch; that is, an attempted access to aprivileged page when PSTATE.priv = 0.

13.3.2 IAE_nfo_page TrapDuring a hardware tablewalk, the I-MMU matches a TTE entry whose nfo (no-fault-only) bit is set.

13.3.3 DAE_privilege_violation TrapThis trap occurs whn the D-MMU detects a privilege violation for a data access; that is, a load or storeinstruction attempts access to a privileged page when PSTATE.priv = 0.

13.3.4 DAE_side_effect_page TrapThis trap occurs when a speculative (nonfaulting) load instruction is issued to a page marked with theside-effect (e) bit = 1.

13.3.5 DAE_nc_page TrapThis trap occurs when an atomic instruction (including a 128-bit atomic load) is issued to a memoryaddress marked uncacheable; for example,, with cp = 0.

13.3.6 DAE_invalid_asi TrapThis trap occurs when an invalid LDA/STA ASI value, invalid virtual address, read to write-onlyregister, or write to read-only register occurs, but not for an attempted user access to a restricted ASI(see the privileged_action trap described below).

13.3.7 DAE_nfo_page TrapThis trap occurs when an access occurs with an ASI other thanASI_{PRIMARY,SECONDARY}_NO_FAULT{_LITTLE} to a page marked with the nfo (no-fault-only) bit.

ImplementationNote

The nfo bit is only checked on I-MMU translations. It is notchecked on hardware tablewalks.

ImplementationNote

For SPARC M5, cp only controls cacheability in the L1 cache, notthe private L2 caches or the the shared L3. SPARC M5 performsatomic operations in the L2 cache and supports the ability tocomplete an atomic operation for pages with the cp bit = 0 evenif the L2 cache is disabled. However, to keep SPARC M5compliant with the Oracle SPARC Architecture 2011specification, a DAE_nc_page trap is generated when an atomicis issued to a memory address marked with cp = 0.


13.3.8 privileged_action Trap

13.3.9 This trap occurs when an access is attempted using a restricted ASI while in non-privileged mode

(PSTATE.priv = 0). *_mem_address_not_aligned TrapsThe lddf_mem_address_not_aligned, stdf_mem_address_not_aligned, and mem_address_not_alignedtraps occur when a load, store, atomic, or JMPL/RETURN instruction with a misaligned address isexecuted.

13.4 MMU Operation SummaryTABLE 13-7 summarizes the behavior of the D-MMU for noninternal ASIs using tabulatedabbreviations. TABLE 13-8 summarizes the behavior of the I-MMU. In each case, and for all conditions,the behavior of the MMU is given by one of the abbreviations in TABLE 13-5. TABLE 13-6 listsabbreviations for ASI types.

Other abbreviations include “w” for the writable bit, “e” for the side-effect bit, and “p” for theprivileged bit.

TABLE 13-7 and TABLE 13-8 do not cover the following cases:

■ Invalid ASIs, ASIs that have no meaning for the opcodes listed, or nonexistent ASIs; for example,ASI_PRIMARY_NO_FAULT for a store or atomic; also, access to SPARC M5 internal registers otherthan LDXA, LDFA, STDFA or STXA; the MMU signals a DAE_invalid_asi trap for this case.

TABLE 13-5 Abbreviations for MMU Behavior

Abbreviation Meaning

ok Normal translation

dasi DAE_invalid_asi trap

dpriv DAE_privilege_violation trap

dse DAE_side_effect_page trap

ipriv IAE_privilege_violation trap

TABLE 13-6 Abbreviations for ASI Types

Abbreviation Meaning

NUC ASI_NUCLEUS*

PRIM Any ASI with PRIMARY translation, except *NO_FAULT

SEC Any ASI with SECONDARY translation, except *NO_FAULT

PRIM_NF ASI_PRIMARY_NO_FAULT*

SEC_NF ASI_SECONDARY_NO_FAULT*

U_PRIM ASI_*_AS_IF_USER_PRIMARY*

U_SEC ASI_*_AS_IF_USER_SECONDARY*

U_PRIV ASI_*_AS_IF_PRIV_*

REAL ASI_*REAL*

Note The *_LITTLE versions of the ASIs behave the same as the big-endian versions with regard to the MMU table of operations.

• 87

■ Attempted access using a restricted ASI in nonprivileged mode; the MMU signals aprivileged_action trap for this case. Attempted use of a hyperprivileged ASI in privileged mode; theMMU also signals privileged_action trap for this case.

■ An atomic instruction (including 128-bit atomic load) issued to a memory address markeduncacheable in a physical cache (that is, with cp = 0 or pa{47} = 1); the MMU signals aDAE_nc_page trap for this case.

■ A data access with an ASI other than ASI_{PRIMARY,SECONDARY}_NO_FAULT{_LITTLE} to a pagemarked nfo; the MMU signals a DAE_nfo_page for this case.

■ An instruction access to a page marked with the nfo (no-fault-only) bit. The MMU signals anIAE_nfo_page trap for this case.

■ An instruction fetch to a memory address marked non-executable (ep = 0). This is checked whenHardware Tablewalk attempts to load the I-MMU, and an IAE_unauth_access trap is takeninstead.

■ Real address out of range; the MMU signals an instruction_real_range trap for this case.

■ Virtual address out of range and PSTATE.am is not set; the MMU signals aninstruction_address_range trap for this case.

See summary of the SPARC M5 ASI map.

TABLE 13-7 D-MMU Operations for Normal ASIs

Condition Behavior

Opcode priv Mode ASI we = 0p = 0

e = 0p = 1

e = 1p = 0

e = 1p = 1

Load

non-privileged

PRIM, SEC — ok dpriv ok dpriv

PRIM_NF, SEC_NF — ok dpriv dse dpriv

privileged

PRIM, SEC, NUC — ok

PRIM_NF, SEC_NF — ok dse

U_PRIM, U_SEC — ok dpriv ok dpriv

REAL — ok

FLUSH

non-privileged

— ok

privileged — ok

Store orAtomic

non-privileged

PRIM, SEC 0 dprot dpriv dprot dpriv

1 ok dpriv ok dpriv

privileged

PRIM, SEC, NUC 0 dprot

1 ok

U_PRIM, U_SEC 0 dprot dpriv dprot dpriv

1 ok dpriv ok dpriv

REAL 0 dprot

1 ok

TABLE 13-8 I-MMU Operations

Condition Behavior

privilege Mode P = 0 P = 1

nonprivileged ok ipriv

privileged ok


13.5 Translation

13.5.1 Instruction Translation

13.5.1.1 Instruction Prefetching

SPARC M5 fetches instructions sequentially (including delay slots). SPARC M5 fetches delay slotsbefore the branch is resolved (before whether the delay slot will be annulled is known). SPARC M5also fetches the target of a DCTI before the delay slot executes.

13.5.2 Data TranslationTABLE 13-9 DMMU Translation (1 of 3)

ASIValue(hex)

Translation

ASI NAME Nonprivileged Privileged Hypervisor

0016–0316

Reserved privileged_action DAE_invalid_asi

0416 ASI_NUCLEUS privileged_action VA → PA

0516–0B16


0C16 ASI_NUCLEUS_LITTLE privileged_action VA → PA

0D16–0F16


1016 ASI_AS_IF_USER_PRIMARY privileged_action VA → PA

1116 ASI_AS_IF_USER_SECONDARY privileged_action VA → PA

1216–1316


1416 ASI_REAL privileged_action RA → PA

1516 ASI_REAL_IO privileged_action RA → PA

1616 ASI_BLOCK_AS_IF_USER_PRIMARY privileged_action VA → PA

1716 ASI_BLOCK_AS_IF_USER_SECONDARY

privileged_action VA → PA

1816 ASI_AS_IF_USER_PRIMARY_LITTLE privileged_action VA → PA

1916 ASI_AS_IF_USER_SECONDARY_LITTLE


1A16–1B16


1C161 ASI_REAL_LITTLE privileged_action RA → PA

1D16 ASI_REAL_IO_LITTLE privileged_action RA → PA

1E16 ASI_BLOCK_AS_IF_USER_PRIMARY_LITTLE


1F16 ASI_BLOCK_AS_IF_USER_SECONDARY_LITTLE


2016 ASI_SCRATCHPAD privileged_action nontranslating

2116 ASI_PRIMARY_CONTEXT_0_REG,ASI_PRIMARY_CONTEXT_1_REG,ASI_SECONDARY_CONTEXT_0_REG,ASI_SECONDARY_CONTEXT_1_REG

privileged_action nontranslating

• 89

2216 ASI_TWINX_AIUP,ASI_STBI_AIUP


2316 ASI_TWINX_AIUS,ASI_STBI_AIUS


2416 Reserved privileged_action DAE_invalid_asi

2516 ASI_QUEUE privileged_action nontranslating

2616 ASI_TWINX_REAL,ASI_STBI_REAL

privileged_action RA → PA

2716 ASI_TWINX_NUCLEUS,ASI_STBI_N


2816–2916


2A16 ASI_TWINX_AIUPL,ASI_STBI_AIUPL


2B16 ASI_TWINX_AIUSL,ASI_STBI_AIUSL


2C16 Reserved privileged_action DAE_invalid_asi

2D16 Reserved privileged_action DAE_invalid_asi

2E16 ASI_TWINX_REAL_LITTLE,ASI_STBI_REAL_LITTLE

privileged_action RA → PA

2F16 ASI_TWINX_NL,ASI_STBI_NL


8016 ASI_PRIMARY VA → PA VA → PA

8116 ASI_SECONDARY VA → PA VA → PA

8216 ASI_PRIMARY_NO_FAULT VA → PA VA → PA

8316 ASI_SECONDARY_NO_FAULT VA → PA VA → PA

8416–8716

Reserved DAE_invalid_asi DAE_invalid_asi

8816 ASI_PRIMARY_LITTLE VA → PA VA → PA

8916 ASI_SECONDARY_LITTLE VA → PA VA → PA

8A16 ASI_PRIMARY_NO_FAULT_LITTLE VA → PA VA → PA

8B16 ASI_SECONDARY_NO_FAULT_LITTLE

VA → PA VA → PA

8C16–AF16


B016 ASI_PIC0,ASI_PIC1,ASI_PIC2,ASI_PIC3

nontranslating nontranslating

B116–BF16


C016 ASI_PST8_P VA → PA VA → PA

C116 ASI_PST8_S VA → PA VA → PA





C616–C716


TABLE 13-9 DMMU Translation (2 of 3)

ASIValue(hex)

Translation



C816 ASI_PST8_PL VA → PA VA → PA

C916 ASI_PST8_SL VA → PA VA → PA

CA16 ASI_PST16_PL VA → PA VA → PA

CB16 ASI_PST16_SL VA → PA VA → PA

CC16 ASI_PST32_PL VA → PA VA → PA

CD16 ASI_PST32_SL VA → PA VA → PA

CE16–CF16


D016 ASI_FL8_P VA → PA VA → PA

D116 ASI_FL8_S VA → PA VA → PA

D216 ASI_FL16_P VA → PA VA → PA

D316 ASI_FL16_S VA → PA VA → PA

D416–D716

DAE_invalid_asi DAE_invalid_asi

D816 ASI_FL8_PL VA → PA VA → PA

D916 ASI_FL8_SL VA → PA VA → PA

DA16 ASI_FL16_PL VA → PA VA → PA

DB16 ASI_FL16_SL VA → PA VA → PA

DC16–DF16


E016 ASI_BLK_COMMIT_PRIMARY VA → PA VA → PA

E116 ASI_BLK_COMMIT_SECONDARY VA → PA VA → PA

E216 ASI_TWINX_P,ASI_STBI_P

VA → PA VA → PA

E316 ASI_TWINX_S,ASI_STBI_S

VA → PA VA → PA

E416–E916


EA16 ASI_TWINX_PL,ASI_STBI_PL

VA → PA VA → PA

EB16 ASI_TWINX_SL,ASI_STBI_SL

VA → PA VA → PA

EC16-EF16


F016 ASI_BLK_PRIMARY VA → PA VA → PA

F116 ASI_BLK_SECONDARY VA → PA VA → PA

F216 ASI_STBI_MRU_PRIMARY VA → PA VA → PA

F316 ASI_STBI_MRU_SECONDARY VA → PA VA → PA

F416-F716


F816 ASI_BLK_PL VA → PA VA → PA

F916 ASI_BLK_SL VA → PA VA → PA

FA16 ASI_STBI_MRU_PRIMARY_LITTLE VA → PA VA → PA

FB16 ASI_STBI_MRU_SECONDARY_LITTLE VA → PA VA → PA

FC16-FF16


TABLE 13-9 DMMU Translation (3 of 3)

ASIValue(hex)

Translation


• 91

13.6 Compliance With the SPARC V9 Annex FThe SPARC M5 MMU complies completely with the SPARC V9 MMU Requirements described inAnnex F of the The SPARC Architecture Manual, Version 9. TABLE 13-10 shows how various protectionmodes can be achieved, if necessary, through the presence or absence of a translation in the I- orD-MMU.

13.7 MMU Internal Registers and ASI Operations

13.7.1 Accessing MMU RegistersAll internal MMU registers can be accessed directly by the virtual processor through ASIs defined bySPARC M5.

See Section 13.5 for details on the behavior of the MMU during all other SPARC M5 ASI accesses.

If the low order three bits of the VA are non-zero in an LDXA/STXA to/from these registers, amem_address_not_aligned trap occurs. Writes to read-only, reads to write-only, illegal ASI values, orillegal VA for a given ASI may cause a DAE_invalid_asi trap.

TABLE 13-10 MMU Compliance With SPARC V9 Annex F Protection Mode

Condition

ResultantProtection Mode

TTE inD-MMU

TTE inI-MMU

WritableAttribute Bit

Yes No 0 Read-only

No Yes Don’t Care Execute-only

Yes No 1 Read/Write

Yes Yes 0 Read-only/Execute

Yes Yes 1 Read/Write/Execute

Note STXA to an MMU register does not require any subsequentinstructions such as a MEMBAR #Sync, FLUSH, DONE, orRETRY before the register effect will be visible to load / store /atomic accesses. SPARC M5 resolves all MMU register hazardsvia an automatic synchronization on all MMU register writes.

Caution SPARC M5 does not check for out-of-range virtual addressesduring an STXA to any internal register; it simply sign-extendsthe virtual address based on VA{51}. Software must guaranteethat the VA is within range.

TABLE 13-11 SPARC M5 MMU Internal Registers and ASI Operations

I-MMUASI

D-MMUASI VA{63:0} Access Register or Operation Name

2116 816 Read/Write Primary Context 0 register

— 2116 1016 Read/Write Secondary Context 0 register

2116 10816 Read/Write Primary Context 1 register

— 2116 11016 Read/Write Secondary Context 1 register


13.7.2 Context RegistersSPARC M5 supports a pair of primary and a pair of secondary context registers per strand, which areshared by the I- and D-MMUs. Primary Context 0 and Primary Context 1 are the primary contextregisters, and a TLB entry for a translating primary ASI can match the context field with eitherPrimary Context 0 or Primary Context 1 to produce a TLB hit. Secondary Context 0 and SecondaryContext 1 are the secondary context registers, and a TLB entry for a translating secondary ASI canmatch the context field with either Secondary Context 0 or Secondary Context 1 to produce a TLB hit.

The Primary Context 0 and Primary Context 1 registers are defined as shown in FIGURE 13-2, wherepcontext is the context value for the primary address space.

FIGURE 13-2 Primary Context 0/1 registers, ASI 2116, VA 816 and ASI 2116, VA 10816

The Secondary Context 0 and Secondary Context 1 Registers are defined in FIGURE 13-3, wherescontext is the context value for the secondary address space.

FIGURE 13-3 Secondary Context 0/1 Registers, ASI 2116, VA 1016 and 2116, VA 11016

The contents of the Nucleus Context register are hardwired to the value zero:

FIGURE 13-4 Nucleus Context Register

CompatibilityNote

To maintain backward compatibility with software designed fora single primary and single secondary context register, writes toPrimary (Secondary) Context 0 Register also update Primary(Secondary) Context 1 Register.

63 16 15 0

— pcontext

63 16 15 0

— scontext

63 0

0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

• 93

APPENDIX A

Programming Guidelines

A.1 MultithreadingIn SPARC M5, each physical core contains eight strands. Each strand has a full set of architected stateregisters and appears to software as a complete processor1. In general, each of the 8 strands share theexecution pipeline including the instruction, data, and L2 caches, branch predictor, out-of-orderscheduling, execution pipelines, and retirement mechanisms. The pipeline is both horizontally andvertically threaded. It is vertically threaded since instructions from different strands can be in adjacentpipeline stages. It is horizontally threaded where parellelism allows. For example, each cycle the Pickunit may pick one instruction from one thread, and another instruction from a different thread, to beissued to independent execution units. SPARC M5 utilizes advanced branch prediction, dualinstruction issue, out-of-order execution with up to 128 instructions in flight using a reorder buffer,hardware prefetching of instruction and data cache misses, and seamless hardware thread switchingto provide high per-thread performance as well as high throughput. The pipeline is partitioned intoseveral major subsections: instruction fetch, select/decode/rename, pick/issue/execute, and commit,each of which are mostly independent of one another.

A.1.1 Instruction fetchEach cycle an arbiter chooses one strand for instruction fetching. The least-recently-fetched strandamong the strands which are ready for fetching is the one chosen. A strand may not be ready forfetching due to instruction cache misses, instruction buffer full conditions, or other reasons. Onceselected for fetch, up to four instructions may be fetched from the instruction cache and placed in per-strand instruction buffers. Instruction fetching occupies the first few stages of the pipeline. Instructionfetching is decoupled from the rest of the pipeline by the Select stage.

A.1.2 Select/Decode/RenameIn the same fashion that instruction fetch chooses a strand for fetching, Select chooses a strand fordecoding, renaming, and transfer to the Pick unit. Each cycle, in parallel with and independent ofinstruction fetch, Select determines which strand, among the ready strands, is the least-recentlyselected. A strand may not be ready due to per-strand wait conditions, such as an empty instructionbuffer or a post-synchronizing2 instruction pending, or due to pipeline-wide resource constraints,such as a lack of reorder buffer entries.

Select then reads up to 2 instructions per cycle from that strand’s instruction buffers, and decodes andrenames the instructions. As it decodes the instructions it identifies any intra-strand dependenciesupon prior instructions, and enforces these dependencies until the instructions are sent to the Pick1. Certain state registers are shared across strands to conserve hardware resources. These shared registers will (eventually) be listed in

this Appendix.2. A post-synchronizing instruction stalls instruction issue for the strand after issuing the post-synchronizing instruction until the

instruction commits. Instructions which are post-synchronizing are listed in Section TABLE A-1, SPARC M5 Instruction Latencies, onpage 97 below.

95

unit and written into the pick queue. Decode also assigns instructions to “slots”. There are 2 primaryslots. Slot 0 is reserved for integer and load/store instructions. Slot 1 is reserved for integer, floating-point, graphics, cryptographic, and control transfer instructions. There is a third auxiliary slot, slot 2,which is reserved for store data operations.

A.1.3 Pick/Issue/ExecutePick selects up to 2 instructions per cycle (an additional store data operation may also be picked)without regard to strand ID from a 36-entry out-of-order scheduler termed the pick queue.Instructions are written with a relative age in mind, so the pick queue picks the oldest readyinstruction within a slot. An instruction is ready when all of its source data is available. Only oneinstruction can be picked for each slot each cycle. There are never any inter-strand instructiondependencies. As Pick issues instructions, pick queue entries are reclaimed, and made available foruse by subsequent instructions coming from Select/decode/rename.

As Pick issues instructions to the execution units, the instructions execute in one of several functionalunits. There are 2 integer units, a floating-point and graphics unit, and a load/store unit. Each of theseunits has independent pipelines and operates in parallel with other execution units. Wheninstructions finish execution, they report their completion status to the Commit unit.

A.1.4 CommitCommit utilizes a 128-entry reorder buffer to hold completion status and other per-instructioninformation. Instructions commit once their completion status is available. Instructions which causean exception complete, but do not commit. Instead, they trap, and the thread begins fetchinginstructions from the trap handler. Similarly, if a branch misprediction occurs, instruction fetchingresumes from the correct path once the branch predictor has been updated, and execution resumesonce all instructions prior to the mispredicted branch commit.

Commit is threaded and each cycle attempts to commit instructions from the least-recently-committedthread among the threads which are ready-to-commit.

A.1.5 Context Switching Between StrandsSince context switching is built into the SPARC M5 pipeline (via the instruction fetch, select/decode/rename, pick/issue/execute, and commit blocks), strands can be switched each cycle with no pipelinestall penalty.

A.1.6 SynchronizationCertain instructions require the pipeline to synchronize. One type of synchronization, post-synchronizing or post-sync’ing, puts the strand in a wait state at Select. The strand remains in a waitstate, and subsequent instructions are not selected, decoded, or renamed until the post-sync clears.This is resolved by the commit of the post-sync’ing instruction.


A.2 Optimizing for Single-Threaded Performance orThroughputSection 1.3.1.1, Single-threaded and multi-threaded performance, on page 31 describes some aspects ofoptimizing for single-threaded and/or multi-threaded performance.

A.3 Instruction LatencyTABLE A-1 lists the minimum single-strand instruction latencies for SPARC M5. When multiple strandsare executing, some or much of the additional latency for multicycle instructions will be overlappedwith execution of the additional strands.

A pre-sync’ing instruction waits at Pick for all prior instructions from the strand to commit beforebeing picked; therefore these instructions have a variable latency, whose minimum is listed in TABLEA-1. A post-sync’ing instruction causes a flush after the instruction commits. Loads have a 5-cycleload-use delay (4 cycles need to be filled but out-of-order execution covers much of this latency inmany cases). Branch instructions have a 2 cycle latency inthe branch unit but are fully pipelined.

TABLE A-1 SPARC M5 Instruction Latencies (1 of 9)

Opcode Description Latency Post-sync Notes

ADD (ADDcc) Add (and modify condition codes) 1

ADDC (ADDCcc) Add with carry (and modify condition codes) 1

ADDXC (ADDXCcc) Add extended with carry (and modify condition codes) 1

AES_DROUND01 AES decrypt round, columns 0 & 1 3

AES_DROUND23 AES decrypt round, columns 2 & 3 3

AES_DROUND01_LAST

AES decrypt last round, columns 0 & 1 3

AES_DROUND23_LAST

AES decrypto last round, columns 2 & 3 3

AES_EROUND01 AES encrypt round, columns 0 & 1 3

AES_EROUND23 AES encrypt round, columns 2 & 3 3

AES_EROUND01_LAST

AES encrypt last round, columns 0 & 1 3

AES_EROUND23_LAST

AES encrypt last round, columns 2 & 3 3

AES_KEXPAND0 AES key expansion without round constant 3

AES_KEXPAND1 AES key expansion with round constant 3

AES_KEXPAND2 AES key expansion without SBOX 3

ALIGNADDRESS Calculate address for misaligned data access 12

ALIGNADDRESSL Calculate address for misaligned data access (little-endian) 12

ALLCLEAN Mark all windows as clean 1 breaks decodegroup

• 97

AND (ANDcc) Logical and (and modify condition codes) 1

ANDN (ANDNcc) Logical and not (and modify condition codes) 1

ARRAY{8,16,32} 3-D address to blocked byte address conversion 12

Bicc Branch on integer condition codes 2

BMASK Write the GSR.mask field 12

BPcc Branch on integer condition codes with prediction 2

BPr Branch on contents of integer register with prediction 2

BSHUFFLE Permute bytes as specified by the GSR.mask field 11

CALL Call and link 2

CAMELLIA_F Camellia F operation 3

CAMELLIA_FL Camellia FL operation 3

CAMELLIA_FLI Camellia FLI operation 3

CASA Compare and swap word in alternate space 20-30 Done in L2 cache

CASXA Compare and swap doubleword in alternate space 20-30 Done in L2 cache

CBcond Compare-and-Branch instructions 2

CMASK{8,16,32} Create GSR.mask from SIMD operation result 12

DES_IP DES initial permutation 3

DES_IIP DES inverse initial permutation 3

DES_KEXPAND DES key expansion 3

DES_ROUND DES round 3

DONE Return from trap 23 Causes flush andredirect to TNPC(23 cycle bubble)

EDGE{8,16,32}{L}{N} Edge boundary processing {little-endian} {non-condition-codealtering}

12

FABS(s,d) Floating-point absolute value 11

FADD(s,d) Floating-point add 11

FALIGNDATA Perform data alignment for misaligned data 11

FANDNOT1{s} Negated src1 and src2 (single precision) 11

FANDNOT2{s} src1 and negated src2 (single precision) 11

FAND{s} Logical and (single precision) 11

FBPfcc Branch on floating-point condition codes with prediction 1

FBfcc Branch on floating-point condition codes 1

FCHKSM16 16-bit partitioned checksum 11

FCMP(s,d) Floating-point compare 11

FCMPE(s,d) Floating-point compare (exception if unordered) 11

FCMPEQ{16,32} Four 16-bit / two 32-bit compare: set integer dest if src1 = src2 12

FCMPGT{16,32} Four 16-bit / two 32-bit compare: set integer dest if src1 > src2 12

FCMPLE{16,32} Four 16-bit / two 32-bit compare: set integer dest if src1 ≤ src2 12




FCMPNE{16,32} Four 16-bit / two 32-bit compare: set integer dest if src1 ≠ src2 12

FDIV(s,d) Floating-point divide 24 SP, 37DP

FEXPAND Four 8-bit to 16-bit expand 11

FHADD{s,d} Floating-point add and halve 11

FHSUB{s,d} Floating-point subtract and halve 11

FiTO(s,d) Convert integer to floating-point 11

FLUSH Flush instruction memory 27 Y Flushes pipeline,27 cycle bubbleminimum

FLUSHW Flush register windows 1 breaks decodegroup

FLCMP{s,d} Lexicographic compare 11

FMADD{s,d} Floating-point multiply-add single/double (fused) 11

FMEAN16 16-bit partitioned average 11

FMOV(s,d) Floating-point move 11

FMOV(s,d)cc Move floating-point register if condition is satisfied 11

FMOV(s,d)R Move floating-point register if integer register contents satisfycondition

11 Cracked into 2ops, breaksdecode group

FMSUB{s,d} Floating-point multiply-subtract single/double (fused) 11

FMUL(s,d) Floating-point multiply 11

FMUL8SUx16 Signed upper 8- x 16-bit partitioned product of correspondingcomponents

11

FMUL8ULx16 Unsigned lower 8- x 16-bit partitioned product of correspondingcomponents

11

FMUL8x16 8- x 16-bit partitioned product of corresponding components 11

FMUL8x16AL Signed lower 8- x 16-bit lower α partitioned product of 4components

11

FMUL8x16AU Signed upper 8- x 16-bit lower α partitioned product of 4components

11

FMULD8SUx16 Signed upper 8- x 16-bit multiply → 32-bit partitioned productof components

11

FMULD8ULx16 Unsigned lower 8- x 16-bit multiply → 32-bit partitionedproduct of components

11

FNADD(s,d) Floating-point add and negate 11

FNAND{s} Logical nand (single precision) 11

FNEG(s,d) Floating-point negate 11

FNHADD{s,d} Floating-point add and halve, then negate 11

FNMADD{s,d} Floating-point add and negate 11

FNMSUB{s,d} Floating-point negative multiply-subtract single/double (fused) 11

FNMUL{s,d} Floating-point multiply and negate 11



• 99

FNsMULd Floating-point multiply and negate 11

FNOR{s} Logical nor (single precision) 11

FNOT1{s} Negate (1’s complement) src1 (single precision) 11

FNOT2{s} Negate (1’s complement) src2 (single precision) 11

FONE{s} One fill (single precision) 11

FORNOT1{s} Negated src1 or src2 (single precision) 11

FORNOT2{s} src1 or negated src2 (single precision) 11

FOR{s} Logical or (single precision) 11

FPACKFIX Two 32-bit to 16-bit fixed pack 11

FPACK{16,32} Four 16-bit/two 32-bit pixel pack 11

FPADD{16,32}{s} Four 16-bit/two 32-bit partitioned add (single precision) 11

FPADD64 Fixed-point partitioned add 11

FPADDS{16,32}{s} Fixed-point partitioned add 11

FPMADDX Unsigned integer multiply-add 11

FPMADDXHI Unsigned integer multiply-add, return high-order 64 bits ofresult

11

FPMERGE Two 32-bit to 64-bit fixed merge 11

FPSUB{16,32}{s} Four 16-bit/two 32-bit partitioned subtract (single precision) 11

FPSUB64 Fixed-point partitioned subtract, 64-bit 11

FPSUBS{16,32}{s} Fixed-point partitioned subtract 11

FSLL{16,32} 16- or 32-bit partitioned shift, left (old mnemonic FSHL) 11

FSLAS{16,32} 16- or 32-bit partitioned shift, left or right (old mnemonicFSHLAS)

11

FSRA{16,32} 16- or 32-bit partitioned shift, left or right (old mnemonicFSHRA)

11

FSRL{16,32} 16- or 32-bit partitioned shift, left or right (old mnemonicFSHRL)

11

FsMULd Floating-point multiply single to double 11

FSQRT(s,d) Floating-point square root 24 SP, 37DP

FSRC1 Copy src1 11

FSRC2{s} Copy src1 (single precision) 11

FSRC2 Copy src2 1

FSRC2{s} Copy src2 (single precision) 11

F(s,d)TO(s,d) Convert between floating-point formats 11

F(s,d)TOi Convert floating point to integer 11

F(s,d)TOx Convert floating point to 64-bit integer 11

FSUB(s,d) Floating-point subtract 11

FUCMP{GT,LE,NE,EQ}8

Compare 8-bit unsigned fixed-point values 12




FXNOR{s} Logical xnor (single precision) 11

FXOR{s} Logical xor (single precision) 11

FxTO(s,d) Convert 64-bit integer to floating-point 11

FZERO{s} Zero fill (single precision) 11

ILLTRAP Illegal instruction 23

INVALW Mark all windows as CANSAVE 1 breaks decodegroup

JMPL Jump and link 2

KASUMI_FI_XOR Kasumi FI followed by XOR 3

KASUMI_FI_FI Kasumi FI followed by FI 3

KASUMI_FL_XOR Kasumi FL followed by XOR 3

LDBLOCKF 64-byte block load 26 Cracked into 8helper loads thatreference L2

LDD Load doubleword 20

LDDA Load doubleword from alternate space 20 Latency can belarger dependingon ASI value

LDDF Load double floating-point 5

LDDFA Load double floating-point from alternate space 5

LDF Load floating-point 5

LDFA Load floating-point from alternate space 5

LDFSR Load floating-point state register lower variable Y

LDSB Load signed byte 5

LDSBA Load signed byte from alternate space 5

LDSH Load signed halfword 5

LDSHA Load signed halfword from alternate space 5

LDSTUB Load-store unsigned byte 20-30 Done in L2 cache

LDSTUBA Load-store unsigned byte in alternate space 20-30 Done in L2 cache

LDSW Load signed word 5

LDSWA Load signed word from alternate space 5

LDTW Load twin word 20 breaks decodegroup

LDTWA Load twin extended word 20 breaks decodegroup

LDTX Load twin extended word 20 breaks decodegroup

LDTXA Load twin extended word from alternate space 20 breaks decodegroup

LDUB Load unsigned byte 5

LDUBA Load unsigned byte from alternate space 5



• 101

LDUH Load unsigned halfword 5

LDUHA Load unsigned halfword from alternate space 5

LDUW Load unsigned word 5

LDUWA Load unsigned word from alternate space 5

LDX Load extended 5

LDXA Load extended from alternate space variableif fromnontranslatingASI, else5

LDXFSR Load extended floating-point state register variable Y

LDXEFSR Load extended floating-point state register variable Y

LZD Leading zero detect on 64-bit integer register 12

MD5 MD5 hash 192 Y

MEMBAR Memory barrier variable membar#sync is post-sync’ing;othermembarforms are not

MOVcc Move integer register if condition is satisfied 1

MOVr Move integer register on contents of integer register 1 breaks decodegroup

MOVdTOx Move floating-point register to integer register 1

MOVsTO{u,s}w Move floating-point register to integer register 12

MOVxTOd Move integer register to floating-point register 1

MOVwTOs Move integer register to floating-point register 12

MPMUL Multiple-precision multiplication variable Y pre-sync

MONTMUL Montgomery multiplication variable Y pre-sync

MONTSQR Montgomery squaring variable Y pre-sync

MULScc Multiply step (and modify condition codes) 12 pre-sync

MULX Multiply 64-bit integers 12

NOP No operation 1

NORMALW Mark other windows as restorable 1 breaks decodegroup

OR (ORcc) Inclusive-or (and modify condition codes) 1

ORN (ORNcc) Inclusive-or not (and modify condition codes) 1

OTHERW Mark restorable windows as other 1 breaks decodegroup

PDIST Distance between eight 8-bit components 11

PDISTN Pixel component distance 12

POPC Population count 12




PREFETCH Prefetch data 1

PREFETCHA Prefetch data from alternate space 1

RDASI Read ASI register variable Y

RDASR Read ancillary state register variable Y

RDCCR Read condition codes register variable Y

RDCFR Read compatibility feature register variable

RDFPRS Read floating-point registers state register variable Y

RDPC Read program counter (PC) 2

RDPR Read privileged register variable Y

RDTICK Read TICK register variable Y

RDY Read Y register variable Y

RESTORE Restore caller’s window 1 breaks decodegroup

RESTORED Window has been restored 1 breaks decodegroup

RETRY Return from trap and retry 23 Causes flush andredirect to TPC(23 cycle bubble)

RETURN Return 1 breaks decodegroup

SAVE Save caller’s window 1 breaks decodegroup

SAVED Window has been saved 1 breaks decodegroup

SDIVcc 32-bit signed integer divide (and modify condition codes) 41-60 pre-sync

SDIVX{i} 64-bit signed integer divide 26-44

SETHI Set high 22 bits of low word of integer register 1

SHA1 SHA-1 hash 226 Y pre-sync



SHUTDOWN (deprecated) 1

SIAM Set interval arithmetic mode 1

SLL Shift left logical 1

SLLX Shift left logical, extended 1

SMUL (SMULcc) Signed integer multiply (and modify condition codes) 12

SRA Shift right arithmetic 1

SRAX Shift right arithmetic, extended 1

SRL Shift right logical 1

SRLX Shift right logical, extended 1

STB Store byte 1

STBA Store byte into alternate space 1



• 103

STBAR Store barrier variable

STBLOCKF 64-byte block store 8

STD Store doubleword 1

STDA Store doubleword into alternate space 1

STDF Store double floating-point 1

STDFA Store double floating-point into alternate space 1

STF Store floating-point 1

STFA Store floating-point into alternate space 1

STFSR Store floating-point state register variable Y

STH Store halfword 1

STHA Store halfword into alternate space 1

STPARTIALF Eight 8-bit/4 16-bit/2 32-bit partial stores 1

STW Store word 1

STWA Store word into alternate space 1

STX Store extended 1

STXA Store extended into alternate space variableif fromnontranslatingASI, else1

dependsupon ASI

STXFSR Store extended floating-point state register variable Y pre-sync

SUB (SUBcc) Subtract (and modify condition codes) 1

SUBC (SUBCcc) Subtract with carry (and modify condition codes) 1

SWAP Swap integer register with memory 20-30 Done in L2 cache

SWAPA Swap integer register with memory in alternate space 20-30 Done in L2 cache

TADDcc(TADDccTV)

Tagged add and modify condition codes (trap on overflow) 1

TSUBcc(TSUBccTV)

Tagged subtract and modify condition codes (trap on overflow) 1

Tcc Trap on integer condition codes (with 8-bit sw_trap_number, ifbit 7 is set, trap to hyperprivileged)

1 if notrap or23 if traptaken

UDIVcc Unsigned integer divide (and modify condition codes) 41-60 pre-sync

UDIVX{i} 64-bit unsigned integer divide 26-44

UMUL (UMULcc) Unsigned integer multiply (and modify condition codes) 12

UMULXHI Unsigned 64 x 64 multiply, returning upper 64 product bits 12

WRASI Write ASI register variable Y

WRASR Write ancillary state register variable Y

WRCCR Write condition codes register variable Y

WRFPRS Write floating-point registers state register variable Y




A.4 Coding PAUSE loopsThe WRPAUSE instruction can be used to place a strand in a paused state to temporarily suspendinstruction execution on that strand. This is useful while implementing exponential backoffalgorithms, for example.

In SPARC M5, care needs to be taken when coding WRPAUSE in loops. Dynamic branch instructiondensity1 should not exceed 10% for a pause loop. That is, in the dynamic instruction count executedduring the pause loop, a maximum of 10% of the instructions should be branches (taken or not taken).The loop may be padded with NOPs to achieve this density. This is a limitation of SPARC M5processors, which will be addressed in future SPARC processors.

WRPAUSE Pause instruction variable Y

WRPR Write privileged register variable Y

WRY Write Y register variable Y

XMULX{HI} XOR multiply 12

XNOR (XNORcc) Exclusive-nor (and modify condition codes) 1

XOR (XORcc) Exclusive-or (and modify condition codes) 1

1. “Branch instruction”, in this context, refers to any control transfer instruction (CTI) except DONE, RETRY, or Tcc



• 105

APPENDIX B

IEEE 754 Floating-Point Support

SPARC M5 conforms to Oracle SPARC Architecture 2011 and the corresponding IEEE Std 754-1985Requirements chapter.

B.1 Special Operand and Result HandlingThe SPARC M5 FGU provides full hardware support for subnormal operands and results for allinstructions. SPARC M5 never generates an unfinished_FPop trap type. SPARC M5 does notimplement a non-standard floating-point mode. The NS bit of the FSR is always read as 0, and writesto it are ignored.

Note SPARC M5 detects tininess before rounding.

107

APPENDIX C

Differences Between SPARC T4 and SPARCM5

This chapter describes the differences between the earlier SPARC T4 and SPARC M5. A summary ofthe differences is provided in the table below.

C.1 Architectural and Microarchitectural DifferencesSPARC M5 reuses the SPARC core from SPARC T4, the unified L3 cache is shared among six cores (vseight in SPARC T4), and all the SPARC M5 SOC components are either re-designed or modified fromSPARC T4.

SPARC M5 is capable of supporting up to 8 processors in a glue-less fashion and provides scalabilityports for scaling beyond 8 processors.

For details, refer to the following chapters:

■ For details of overall architectural and microarchitectural differences, see Chapter 1, SPARC M5Basics.

■ For details of all supported system configuration, see Chapter 34, System Configurations.

■ For details of coherency and ordering protocol differences, see Chapter 30, Coherency and OrderingUnit (COU).

Area vs SPARC T4 Description

Architecture andMicroarchitecture

Different Section C.1

Data Format Same

Registers Same

Instruction Format Same

InstructionDefinitions

Same

Traps Same

Interrupt Handling Different Section C.2

Memory Models Same

Address Spaces &ASIs


PerformanceMesurement


Crypto Same

MMU Same

109

C.2 Interrupt Handling DifferencesHandling of I/O mondo interrupts is different in SPARC M5.

For details, refer to the following chapters:

■ Interrupt Handling Chapter

■ PCIE Chapter

■ Non-cacheable Unit (NCU) Chapter

C.3 Address Spaces and ASIs Differences

C.3.1 ASIsAddressing of all ASIs in SPARC core (including L2) does not change from SPARC T4 to SPARC M5.

See Address Spaces and ASIs Chapter for details.

C.3.2 CSRsAddressing of all CSRs in SPARC core (including L2) does not change from SPARC T4 to SPARC M5.

See Address Spaces and ASIs Chapter for details.


APPENDIX D

Cache Coherency and Ordering

D.1 Cache and Memory InteractionsThis appendix describes various interactions between the caches and memory, and the managementprocesses that an operating system must perform to maintain data integrity in these cases. Inparticular, it discusses the following:

■ Invalidation of one or more cache entries—when and how to do it

■ Differences between cacheable and noncacheable accesses

■ Ordering and synchronization of memory accesses

■ Accesses to addresses that cause side effects (I/O accesses)

■ Nonfaulting loads

■ Cache sizes, associativity, replacement policy, etc.

D.2 Coherency OverviewPlease refer to “VF Link ERS, Coherency Chapter”.

111

D.3 Cache FlushingData in the level-1 (read-only or writethrough) caches can be flushed by invalidating the entry in thecache (in a way that also leaves the L2 directory in a consistent state). Modified data in the level-2(writeback) cache must be written back to memory when flushed.

Cache flushing is required in the following cases:


■ I-cache: Flush is needed before executing code that is modified by a local store instruction. This isdone with the FLUSH instruction, which just forces previous stores to complete to all affectedcaches.. Flushing the I-cache with ASI accesses (Section 23.5, L1 I-Cache Diagnostic Access, on page1489) also works, because the L2 directory correctly handles the cases where the directory thinksthe line is in the L1, but the L1 doesn’t.

■ D-cache: Flush is needed when a physical page is changed from (physically) cacheable to(physically) noncacheable. This is done with a displacement flush (Displacement Flushing, below), orwith ASI accesses (see Section 20.10, L1 I-Cache Diagnostic Access, on page 90), which work forsimilar reasons as for the I-cache.

■ L2 cache: Flush is needed for stable storage. Examples of stable storage include battery-backedmemory and transaction logs. The recommended way to perform this is by using the PrefetchICEinstruction (see Section 20.19, L3 Index and Bank Hashing, on page 109). Alternatively, this can bedone by a displacement flush (see the next section). Flushing the L2 cache flushes thecorresponding blocks from the I- and D-caches, because SPARC M5 maintains inclusion betweenthe L2 and L1 caches.

■ Errors: Flush is needed for error processing. Examples include (1) forcing UE data from a cache tomemory, in order to convert it to NotData, or (2) using flushes to force memory (not cache) readsand writes, to diagnose a memory error, or (3) writing a line of good data and flushing it tomemory, to overwrite a memory soft error.

D.3.1 Displacement FlushingCache flushing of the L2 cache or the D-cache can be accomplished by a displacement flush. This isdone by placing the cache in direct-map mode, and reading a range of read-only addresses that mapto the corresponding cache line being flushed, forcing out modified entries in the local cache. Caremust be taken to ensure that the range of read-only addresses is mapped in the MMU before startinga displacement flush; otherwise, the TLB miss handler may put new data into the caches. In addition,the range of addresses used to force lines out of the cache must not be present in the cache whenstarting the displacement flush. (If any of the displacing lines are present before starting thedisplacement flush, fetching the already present line will not cause the proper way in the direct-mapped mode L2 to be loaded; instead, the already present line will stay at its current location in thecache.)

Architectural Note – Does direct-mapped mode still work if any L2 lines are mapped out? If so,displacement flushing may still be possible, but involves clearing all U bits, then forcing at least 6 MB ofmisses.

D.3.2 Memory Accesses and Cacheability

Note Diagnostic accesses to the L2 cache can be used to invalidate aline, but they are not an alternative to PrefetchICE ordisplacement flushing. L2 diagnostic accesses do not causeinvalidation of L1 lines (breaking L1 inclusion) and modifieddata in the L2 cache will not be written back to memory usingthese ASI accesses. See Section 20.22, L2 Cache Diagnostic Access,on page 111.

Note Atomic load-store instructions are treated as both a load and astore; they can be performed only in cacheable address spaces.

• 113

In SPARC M5, all memory accesses are cached in the L2 cache (as long as the L2 cache is enabled). Thecp bit in the TTE corresponding to the access controls whether the memory access will be cached inthe primary caches (if cp = 1, the access is cached in the primary caches; if cp = 0 the access is notcached in the primary caches). Atomic operations are always performed at the L2 cache.

D.3.3 Coherence DomainsTwo types of memory operations are supported in SPARC M5: cacheable and noncacheable accesses,as indicated by the page translation. Cacheable accesses are inside the coherence domain;noncacheable accesses are outside the coherence domain.

SPARC V9 does not specify memory ordering between cacheable and noncacheable accesses. SPARCM5 maintains TSO ordering, regardless of the cacheability of the accesses, relative to other access byprocessors.

Programming Note – Ordering of processor accesses relative to DMA accesses roughly follows PCIordering rules for PCI devices.

See the The SPARC Architecture Manual-Version 9 for more information about the SPARC V9 memorymodels.

On SPARC M5, a MEMBAR #Lookaside is effectively a NOP and is not needed for forcing order ofstores vs. loads to noncacheable addresses.

D.3.3.1 Cacheable Accesses

Accesses that fall within the coherence domain are called cacheable accesses. They are implemented inSPARC M5 with the following properties:

■ Data resides in real memory locations.

■ They observe the supported cache coherence protocol.

■ The unit of coherence is 64 bytes at the system level (coherence between the virtual processors andI/O), enforced by the L2 cache.

■ The unit of coherence for the primary caches (coherence between multiple virtual processors) is theprimary cache line size (16 bytes for the data cache, 32 bytes for the instruction cache), enforced bythe L2 cache directories.

D.3.3.2 Noncacheable and Side-Effect Accesses

Accesses that are outside the coherence domain are called noncacheable accesses. Accesses of some ofthese memory (or memory mapped) locations may result in side effects. Noncacheable accesses areimplemented in SPARC M5 with the following properties:

■ Data may or may not reside in real memory locations.

■ Accesses may result in program-visible side effects; for example, memory-mapped I/O controlregisters in a UART may change state when read.

■ Accesses may not observe supported cache coherence protocol.

■ The smallest unit in each transaction is a single byte.

Noncacheable accesses are all strongly ordered with respect to other noncacheable accesses (regardlessof the e bit). Speculative loads with the e bit set cause a DAE_so_page trap.

Note The side-effect attribute does not imply noncacheability.


D.3.3.3 Global Visibility and Memory Ordering

To ensure the correct ordering between the cacheable and noncacheable domains, explicit memorysynchronization is needed in the form of MEMBARs or atomic instructions. CODE EXAMPLE D-1illustrates the issues involved in mixing cacheable and noncacheable accesses.

CODE EXAMPLE D-1 Memory Ordering and MEMBAR Examples

Assume that all accesses go to non-side-effect memory locations.

Process A:While (1){

Store D1:data produced1 MEMBAR #StoreStore (needed in PSO, RMO)

Store F1:set flagWhile F1 is set (spin on flag)Load F1

2 MEMBAR #LoadLoad | #LoadStore (needed in RMO)

Load D2}

Process B:While (1){

While F1 is cleared (spin on flag)

Load F12 MEMBAR #LoadLoad | #LoadStore (needed in RMO)

Load D1

Store D21 MEMBAR #StoreStore (needed in PSO, RMO)

Store F1:clear flag}

Due to load and store buffers implemented in SPARC M5, CODE EXAMPLE D-1 may not work for RMOaccesses without the MEMBARs shown in the program segment.

Under TSO, loads and stores (except block stores) cannot pass earlier loads, and stores cannot passearlier stores; therefore, no MEMBAR is needed.

Under RMO, there is no implicit ordering between memory accesses; therefore, the MEMBARs at both#1 and #2 are needed.

Note A MEMBAR #MemIssue or MEMBAR #Sync is needed ifordering of cacheable accesses following noncacheable accessesmust be maintained for RMO cacheable accesses.

• 115

D.3.4 Memory Synchronization: MEMBAR and FLUSHThe MEMBAR (STBAR in SPARC V8) and FLUSH instructions provide for explicit control of memoryordering in program execution. MEMBAR has several variations; their implementations in SPARC M5are described below. See the references to “Memory Barrier,” “The MEMBAR Instruction,” and“Programming With the Memory Models,” in The The SPARC Architecture Manual-Version 9 for moreinformation.

D.3.4.1 MEMBAR #LoadLoad

All loads on SPARC M5 switch a strand out until the load completes. Thus, MEMBAR #LoadLoad istreated as a NOP on SPARC M5.

D.3.4.2 MEMBAR #StoreLoad

MEMBAR #StoreLoad forces all loads after the MEMBAR to wait until all stores before theMEMBAR have reached global visibility. MEMBAR #StoreLoad behaves the same as MEMBAR#Sync on SPARC M5.

D.3.4.3 MEMBAR #LoadStore

All loads on SPARC M5 switch a strand out until the load completes. Thus, MEMBAR #LoadStore istreated as a NOP on SPARC M5

D.3.4.4 MEMBAR #StoreStore and STBAR

Stores on SPARC M5 maintain order in the store buffer. Thus Membar #StoreStore is treated as aNOP on SPARC M5.

D.3.4.5 MEMBAR #Lookaside

Loads and stores to noncacheable addresses are “self-synchronizing” on SPARC M5. Thus MEMBAR#Lookaside is treated as a NOP on SPARC M5.

D.3.4.6 MEMBAR #MemIssue

MEMBAR #MemIssue forces all outstanding memory accesses to be completed before any memoryaccess instruction after the MEMBAR is issued. It must be used to guarantee ordering of cacheableaccesses following noncacheable accesses. For example, I/O accesses must be followed by a MEMBAR#MemIssue before subsequent cacheable stores; this ensures that the I/O accesses reach globalvisibility (as viewed by other strands) before the cacheable stores after the MEMBAR.

Notes STBAR has the same semantics as MEMBAR #StoreStore; it isincluded for SPARC-V8 compatibility.

SPARC M5 block stores and block-init stores are RMO. If aprogram needs to maintain order between RMO stores todifferent L2 cache lines, it should use a MEMBAR #Sync.

Note For SPARC V9 compatibility, this variation should be usedbefore issuing a load to an address space that cannot besnooped,


Since loads are already self-synchronizing, Membar #MemIssue just needs to drain the store buffer(and receive all the store ACKs) before allowing memory operations to issue again. This is the sameoperation as SPARC M5’s Membar #Sync.

D.3.4.7 MEMBAR #Sync (Issue Barrier)

Membar #Sync forces all outstanding instructions and all deferred errors to be completed before anyinstructions after the MEMBAR are issued.

D.3.4.8 Self-Modifying Code (FLUSH)

The SPARC V9 instruction set architecture does not guarantee consistency between code and dataspaces. A problem arises when code space is dynamically modified by a program writing to memorylocations containing instructions. Dynamic optimizers, LISP programs, and dynamic linking requirethis behavior. SPARC V9 provides the FLUSH instruction to synchronize instruction and data memoryafter code space has been modified.

In SPARC M5, FLUSH behaves like a store instruction for the purpose of memory ordering. Inaddition, all instruction fetch (or prefetch) buffers are invalidated. The issue of the FLUSH instructionis delayed until previous (cacheable) stores are completed. Instruction fetch (or prefetch) resumes atthe instruction immediately after the FLUSH.

D.3.5 Atomic OperationsSPARC V9 provides three atomic instructions to support mutual exclusion. These instructions behavelike both a load and a store but the operations are carried out indivisibly. Atomic instructions may beused only in the cacheable domain.

An atomic access with a restricted ASI in unprivileged mode (PSTATE.priv = 0) causes aprivileged_action trap. An atomic access with a noncacheable address causes a data_access_exceptiontrap (with SFSR.ft = 4, atomic to page marked noncacheable). An atomic access with an unsupportedASI causes a DAE_invalid_ASI trap. TABLE D-1 lists the ASIs that support atomic accesses.

Note MEMBAR #Sync is a costly instruction; unnecessary usage mayresult in substantial performance degradation.

TABLE D-1 ASIs That Support SWAP, LDSTUB, and CAS

ASI Name

ASI_NUCLEUS{_LITTLE}

ASI_AS_IF_USER_PRIMARY{_LITTLE}

ASI_AS_IF_USER_SECONDARY{_LITTLE}

ASI_PRIMARY{_LITTLE}

ASI_SECONDARY{_LITTLE}

ASI_REAL{_LITTLE}

Notes Atomic accesses with nonfaulting ASIs are not allowed, becausethese ASIs have the load-only attribute.

For all atomics, allocation is done to the L2 cache only and willinvalidate the L1s.

• 117

D.3.5.1 SWAP Instruction

SWAP atomically exchanges the lower 32 bits in an integer register with a word in memory. Thisinstruction is issued only after store buffers are empty. Subsequent loads interlock on earlier SWAPs.

D.3.5.2 LDSTUB Instruction

LDSTUB behaves like SWAP, except that it loads a byte from memory into an integer register andatomically writes all 1’s (FF16) into the addressed byte.

D.3.5.3 Compare and Swap (CASX) Instruction

Compare-and-swap combines a load, compare, and store into a single atomic instruction. It comparesthe value in an integer register to a value in memory; if they are equal, the value in memory isswapped with the contents of a second integer register. All of these operations are carried outatomically; in other words, no other memory operation may be applied to the addressed memorylocation until the entire compare-and-swap sequence is completed.

D.3.6 Nonfaulting LoadA nonfaulting load behaves like a normal load, except that

■ It does not allow side-effect access. An access with the e bit set causes a DAE_so_page trap.

■ It can be applied to a page with the nfo bit set; other types of accesses will cause a DAE_NFO_pagetrap.

Nonfaulting loads are issued with ASI_PRIMARY_NO_FAULT{_LITTLE} orASI_SECONDARY_NO_FAULT{_LITTLE}. A store with a NO_FAULT ASI causes a DAE_invalid_ASItrap.

When a nonfaulting load encounters a TLB miss, the operating system should attempt to translate thepage. If the translation results in an error (for example, address out of range), a 0 is returned and theload completes silently.

Typically, optimizers use nonfaulting loads to move loads before conditional control structures thatguard their use. This technique potentially increases the distance between a load of data and the firstuse of that data, to hide latency; it allows for more flexibility in code scheduling. It also allows forimproved performance in certain algorithms by removing address checking from the critical codepath.

For example, when following a linked list, nonfaulting loads allow the null pointer to be accessedsafely in a read-ahead fashion if the operating system can ensure that the page at virtual address 016is accessed with no penalty. The nfo (nonfault access only) bit in the MMU marks pages that aremapped for safe access by nonfaulting loads but can still cause a trap by other, normal accesses. Thisallows programmers to trap on wild pointer references (many programmers count on an exceptionbeing generated when accessing address 016 to debug code) while benefitting from the acceleration ofnonfaulting access in debugged library routines.

D.4 L1 I-CacheThe L1 Instruction cache is 16 Kbytes, physically tagged and indexed, with 32-byte lines, and 8-wayassociative with pseudo-random replacement. The format used to index the cache is shown in TABLED-2.


D.4.1 LFSR Replacement AlgorithmDetails TBD.

D.4.2 Direct-Mapped ModeThe I-cache direct-mapped mode (see Section 20.9.1, ASI_DC_DIRECT_MAP_REG, on page 89) worksby forcing all replacements to the “way” identified by bits [13:11] of the virtual address. Since linesalready present are not affected but only new lines brought into the cache are affected, it is safe toturn on (or off) the direct-mapped mode at any time.

D.4.3 I-Cache DisableClearing the I-cache enable bit (see Section 5., ASI changes for SPARC VT core, on page 6) stops allaccesses to the I-cache for that strand. All fetches will miss, and the returned data will not fill the I-cache. Invalidates will still be serviced while theI-cache is disabled.

D.5 L1 D-CacheThe L1 Data cache is 8 Kbytes, writethrough, physically tagged and indexed, with 16-byte lines, and4-way associative with true LRU replacement. The format used to index the cache is shown in TABLED-3.

D.5.1 LRU Replacement AlgorithmThe D-cache replacement algorithm is true least-recently-used (LRU). Six bits are maintained for eachcache index.

TABLE D-2 L1 Instruction Cache Addressing


39:11 tag Tag for cache line.

10:5 set Selects cache set containing the cache line.

4:2 instr Selects 32-bit instruction in cache line.

1:0 — Always 0 for access to 32-bit instructions.

TABLE D-3 L1 Data Cache Addressing


39:11 tag Tag for cache line.

10:4 set Selects cache set containing the cache line.

3:0 data Selects data byte(s) in cache line.

• 119

D.5.2 Direct-Mapped ModeThe D-cache direct-mapped mode (see Section 20.9.1, ASI_DC_DIRECT_MAP_REG, on page 89) worksby changing the replacement algorithm from LRU to instead use two bits of index (address[12:11]) toselect the “way.” Since lines already present are not affected but only new lines brought into the cacheare affected, it is safe to turn on (or off) the direct-mapped mode at any time.

Note that if the D-cache is in direct-mapped mode, and a parity error occurs, the way replaced will bethe way which experienced the parity error. This overrides the index selected by the address in direct-mapped mode.

D.5.3 D-Cache DisableThe D-cache enable bit (see Section 5., ASI changes for SPARC VT core, on page 6) works by modifyingthe replacement algorithm, by forcing all D-cache misses to be nonallocating. Thus, dc = 0 has noeffect if a line is already in the cache (it hits anyway), but only affects D-cache misses. Stores that hitin the L1 will be performed in the L2, then update the L1 (as normal).

To get the D-cache fully disabled, the dc bit must be off on all strands in the virtual processor, and theD-cache must be flushed in a way that doesn’t bring new lines back in. This can be done by storing(from a different core) to each line that is in the D-cache, or by displacement flushing the L2 cache sothat inclusion will force allD-cache lines to be invalidated.

D.6 L2 Cache


APPENDIX E

Glossary

This chapter defines concepts and terminology unique to the SPARC M5 implementation. Definitionsof terms common to all Oracle SPARC Architecture implementations may be found in the Definitionschapter of Oracle SPARC Architecture 2011.

ALU Arithmetic Logical Unit

architectural state Software-visible registers and memory (including caches).

ARF Architectural register file.

blocking ASI An ASI access that accesses its ASI register or array location once all older instructions inthat strand have retired, no instructions in the other strand can issue, and the store queue,TSW, and LMB are all empty.

branch outcome A reference as to whether or not a branch instruction will alter the flow of execution fromthe sequential path. A taken branch outcome results in execution proceeding with theinstruction at the branch target; a not-taken branch outcome results in executionproceeding with the instruction along the sequential path after the branch.

branch resolution A branch is said to be resolved when the result (that is, the branch outcome and branchtarget address) has been computed and is known for certain. Branch resolution can takeplace late in the pipeline.

branch target address The address of the instruction to be executed if the branch is taken.

commit An instruction commits when it modifies architectural state.

complex instruction A complex instruction is an instruction that requires the creation of secondary “helper”instructions for normal operation, excluding trap conditions such as spill/fill traps (whichuse helpers). Refer toInstruction Latency on page 97 for a complete list of all complexinstructions and their helper sequences.

consistency See coherence.

CPU Central Processing Unit. A synonym for virtual processor.

CSR Control Status register.

FP Floating point.

L2C (or L2$) Level 2 cache.

leaf procedure A procedure that is a leaf in the program’s call graph; that is, one that does not call (byusing CALL or JMPL) any other procedures.

nonblocking ASI A nonblocking ASI access will access its ASI register/array location once all olderinstructions in that strand have retired, and there are no instructions in the other strandwhich can issue.

older instruction Refers to the relative fetch order of instructions. Instruction i is older than instruction j ifinstruction i was fetched before instruction j. Data dependencies flow from olderinstructions to younger instructions, and an instruction can only be dependent upon olderinstructions.

121

one hot An n-bit binary signal is one hot if and only if n − 1 of the bits are each zero and a singlebit is a 1.

quadlet

SIAM Set interval arithmetic mode instruction.

younger instruction See older instruction.

writeback The process of writing a dirty cache line back to memory before it is refilled.


APPENDIX F

Bibliography

[contents of this appendix are TBD]

123

Index

AAccumulated Exception (aexc) field of FSR register, 76, 77Address Mask (am)

field of PSTATE register, 58, 88address space identifier (ASI)

identifying memory location, 55ADDX instruction, 43, 44, 45ADDXC instruction, 43, 44, 45ASI

restricted, 88support for atomic instructions, 117usage, 58–63

ASI, See address space identifier (ASI)ASI_AS_IF_USER_PRIMARY, 87ASI_AS_IF_USER_SECONDARY, 87ASI_BLK_INIT_ST_PRIMARY, 65ASI_BLK_INIT_ST_PRIMARY_LITTLE, 65ASI_BLK_INIT_ST_SECONDARY, 65ASI_BLK_INIT_ST_SECONDARY_LITTLE, 65ASI_NUCLEUS, 87ASI_PRIMARY_NO_FAULT, 83, 86, 87, 88ASI_PRIMARY_NO_FAULT_LITTLE, 83, 86, 88ASI_QUEUE registers, 52–53ASI_REAL, 64ASI_REAL_IO, 64ASI_REAL_IO_LITTLE, 64ASI_REAL_LITTLE, 64ASI_SCRATCHPAD, 64ASI_SECONDARY_NO_FAULT, 83, 86, 87, 88ASI_SECONDARY_NO_FAULT_LITTLE, 83, 86, 88ASI_ST_BLKINIT_AS_IF_USER_PRIMARY, 65ASI_ST_BLKINIT_AS_IF_USER_PRIMARY_LITTLE, 65ASI_ST_BLKINIT_AS_IF_USER_SECONDARY, 65ASI_ST_BLKINIT_AS_IF_USER_SECONDARY_LITTLE, 6

5ASI_ST_BLKINIT_NUCLEUS, 65ASI_ST_BLKINIT_NUCLEUS_LITTLE, 65ASI_STBI_AIUP, 65ASI_STBI_AIUPL, 65ASI_STBI_AIUS, 65ASI_STBI_AIUS_L, 65ASI_STBI_N, 65ASI_STBI_NL, 65ASI_STBI_P, 65ASI_STBI_PL, 65

ASI_STBI_S, 65ASI_STBI_SL, 65atomic instructions, 117–118

Bblock

load instructions, 38, 65memory operations, 79store instructions, 38

block-initializing store ASIs, 65branch instruction, 58

CC8BL instruction, 42cache flushing, when required, 112cacheable in indexed cache (cp, cv) fields of TTE, 84caching

TSB, 85CALL instruction, 58CANRESTORE register, 74CANSAVE register, 74clean window, 74clean_window exception, 74CLEANWIN register, 74compare and branch instructions, 42compare and swap instructions, 42compatibility with SPARC V9

terminology and concepts, 121context

field of TTE, 83Current Exception (cexc) field of FSR register, 76, 77CWBCC instruction, 42CWBCS instruction, 42CWBE instruction, 42CWBG instruction, 42CWBGE instruction, 42CWBGU instruction, 42CWBL instruction, 42CWBLE instruction, 42CWBLEU instruction, 42CWBNE instruction, 42CWBNEG instruction, 42CWBPOS instruction, 42

125

CWBVC instruction, 42CWBVS instruction, 42CWP register, 74CXBCC instruction, 42CXBCS instruction, 42CXBE instruction, 42CXBG instruction, 42CXBGE instruction, 42CXBGU instruction, 42CXBLE instruction, 42CXBLEU instruction, 42CXBNE instruction, 42CXBNEG instruction, 42CXBPOS instruction, 42CXBVC instruction, 42CXBVS instruction, 42

DDAE_invalid_ASI exception, 78, 92DAE_invalid_asi exception, 58DAE_privilege_violation exception, 84DAE_so_page, 114Dcache

direct-mapped mode, 120disabling, 120displacement flush, 113flushing, 113

deferredtrap, 73

Dirty Lower (dl) field of FPRS register, 76Dirty Upper (du) field of FPRS register, 76D-MMU, 87

Eendianness, 84enhanced security environment, 74errors

See also individual error entriesextended

instructions, 80

Ffloating point

deferred trap queue (fq), 76, 77exception handling, 75trap type (ftt) field of FSR register, 77

Floating Point Condition Code (fcc)0 (fcc0) field of FSR register, 763 (fcc3) field of FSR register, 76field of FSR register in SPARC-V8, 76

Floating Point Registers State (FPRS) register, 76FLUSH instruction, 78fp_exception_ieee_754 exception, 76, 77fp_exception_other exception, 77FPMADDX instruction, 41FPMADDXHI instruction, 41

Gglobal level register, See GL registerGraphics Status register, See GSR

Hhardware_error floating-point trap type, 76, 77

IIAE_privilege_violation exception, 84Icache

direct-mapped mode, 119disabling, 119flushing, 113

IEEE Std 754-1985, 76IEEE support

infinity arithmetic, 107normal operands/subnormal result, 107

IEEE_754_exception floating-point trap type, 77illegal_instruction exception, 73, 76, 77, 79, 80ILLTRAP instructions, 73implementation-dependent instructions, See IMPDEP2A

instructionsinstruction fetching

near VA (RA) hole, 57instruction latencies, 97–105instruction-level parallelism

history, 9instruction-level parallelism, See ILPinstructions

atomicload-store, 42

compare and swap, 42crypto

AES, 43Camellia, 44DES, 43, 44hash operations, 44, 45integer multiply-add, 41Kasumi, 46MPMUL, 45

integerdivision, 74multiplication, 74register file, 74

interruptcauses, 52hardware delivery mechanism, 51

invalid_fp_register floating-point trap type, 76, 77invert endianness, (ie) field of TTE, 84ISA, See instruction set architecture

JJMPL instruction, 58jump and link, See JMPL instruction

126 SPARC M5 Processor Supplement • Draft D0.7, 29 May 2014

KKASUMI_FI_FI instruction, 48KASUMI_FI_XOR instruction, 46KASUMI_FL_XOR instruction, 46

LL2 cache

configuration, 13displacement flush, 113flushing, 113

LDBLOCKF instruction, 38LDD instruction, 78LDDF_mem_address_not_aligned exception, 79LDQF instruction, 79LDQFA instruction, 79LDXA instruction, 58load

block, See block load instructionsshort floating-point, See short floating-point load

instructionsload-store instructions

compare and swap, 42

Mmem_address_not_aligned exception, 87, 92MEMBAR #LoadLoad, 56MEMBAR #Lookaside, 56MEMBAR #MemIssue, 56, 115MEMBAR #StoreLoad, 39, 40, 56MEMBAR #StoreStore, 78MEMBAR #Sync, 92MEMBAR #Sync, 115memory

cacheable and noncacheable accesses, 114location identification, 55model, 40noncacheable accesses, 114order between references, 56ordering in program execution, 116–117

memory models, 55MMU

requirements, compliance with SPARC V9, 92

NN_REG_WINDOWS, 74nested traps

in SPARC-V9, 73No-Fault Only (nfo) field of TTE, 83, 88nonfaulting loads, 118

speculative, 86Non-Standard (ns) field of FSR register, 77Nucleus Context register, 93

OOTHERWIN register, 74out of range

virtual address, 57, 58virtual address, as target of JMPL or RETURN, 58virtual addresses, during STXA, 92

Ppage

size field of TTE, 85size, encoding in TTE, 85

partial storeinstruction, 79

Partial Store Order (PSO), 55pcontext field, 93PCR register

fields, 70performance instrumentation counter register, See PIC

registerphysical core

components, 11UltraSPARC T2 microarchitecture, 11

PIC registerfield description, 71

precise traps, 73PREFETCHA instruction, 78Primary Context register, 93privileged

(p) field of TTE, 84(priv) field of PSTATE register, 84, 86, 87

privileged_action exceptionattempting access with restricted ASI, 55, 87, 88

processormemory model, 40

processor interrupt level register, See PIL registerprocessor state register, See PSTATE registerprocessor states, See execute_statePSTATE register fields

pefSee also pef field of PSTATE register

PTE (page table entry), See translation table entry (TTE)

Qquad-precision floating-point instructions, 75queue

Not Empty (qne) field of FSR register, 77

RRA hole, 57real page number (ra) field of TTE, 83Relaxed Memory Order (RMO), 55, 56reserved

fields in opcodes, 73instructions, 73

resumable_error exception, 52RETURN instruction, 58RMO, See relaxed memory order (RMO) memory modelRounding Direction (rd) field of FSR register, 76

127

SSAVE instruction, 74scontext field, 93Secondary Context register, 93secure environment, 74self-modifying code, 78short floating point

load instruction, 79store instruction, 79

side effectfield of TTE, 84

sl0/sl1 field settings of PCR register, 71software

defined fields of TTE, 83Translation Table, 78, 85

software-defined field (soft) of TTE, 83SPARC V9

compliance with, 73speculative load, 86STBLOCKF instruction, 38STD instruction, 78STDF_mem_address_not_aligned exception, 79STQF instruction, 79STQFA instruction, 79STXA instruction, 58supervisor interrupt queues, 52

TTBA register, 58terminology for SPARC V9, definition of, 121thread-level parallelism

advantages, 10background, 10differences from instruction-level parallelism, 10

thread-level parallelism, See TLPThroughput Computing, 9TNPC register, 58Total Store Order (TSO), 55, 56TPC register, 58Translation Table Entry see TTETranslation Table Entry, See TTEtrap

mask behavior, ??–49stack, 73state registers, 73

Trap Enable Mask (tem) field of FSR register, 76, 76, 77trap level register, See TL registertrap next program counter register, See TNPC registertrap program counter register, See TPC registertrap stack array, See TSAtrap state register, See TSTATE registertrap type register, See TT registerTrap-on-Event (toe) field of PCR register, 70traps

See also exceptions and individual trap namesTSB

caching, 85index to smallest, 83in-memory, 78

organization, 85TSO, See total store order (TSO) memory modeltstate, See trap state (TSTATE) registerTTE, 83

UUltraSPARC T2

extended instructions, 80internal registers, 87memory model supported, 55minimum single-strand instruction latencies, 97–105

unimplemented instructions, 73

VVA hole, 57VA_tag field of TTE, 83Valid (v) field of TTE, 83Version (ver) field of FSR register, 77virtual address

space illustrated, 57Visual Instruction Set, See VIS instructions

Wwindow fill exception, See also fill_n_normal exceptionwindow spill exception, See also spill_n_normal exceptionwritable (w) field of TTE, 85

128 SPARC M5 Processor Supplement • Draft D0.7, 29 May 2014

sparc m5™ supplement - oracle cloud...oracle corporation 4150 network circle santa clara, ca 95054...

Documents