a coherent hybrid sram and stt-ram l1 cache architecture for shared memory...
TRANSCRIPT
-
A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicores
Jianxing Wang, Yenni Tim Zhong-Liang Ong and Weng-Fai Wong
School of Computing National University of Singapore
Zhenyu Sun and Hai (Helen) LiSwanson School of Engineering
University of Pittsburgh
1
-
Outline
• STT-RAM Basics• Cell structure and advantages• Challenges and motivations
• Hybrid L1 Cache Architecture• Naïve solution• The MESI protocol • Block transfer mechanisms
• Evaluation• Performance and energy• STT-RAM endurance
• Conclusion
2
-
STT-RAM Basics
• Magnetic Tunnel Junction (MTJ)• Two ferromagnetic layers separated by a barrier
3
-
STT-RAM Basics (cont.)
• Advantages• Non-volatile, near zero leakage energy• As fast as SRAM (read)• As dense as DRAM• Multi-level cell capability (stacking MTJs)• CMOS-compatible• Universal memory
4
-
Motivations of Hybrid Cache
• Expensive write operation of STT-RAM• High latency (10ns+)• High energy• Compensated by relaxed non-volatility [Smullen et al. 11]
• Refresh
• Endurance• Intense writes in L1
• bodytrack: L1(s) / L2 = ~29!• Additional synchronous operations under multi-core
environment 5
-
Proposed Hybrid Cache Hierarchy
6
SRAM
STT-R
AM
Core #0
L1 D-Cache Controller #0
L2 Cache Controller
L2 Cache (LLC)
MESI MESI
L1 D-Cache Controller #1
L1 D-Cache Controller #2
L1 D-Cache Controller #3
MESI MESI
SR
AM
STT-R
AM
Core #1
SRAM
STT-R
AM
Core #2
SRAM
STT-R
AM
Core #3
-
Cache Block Management
• Naïve solution• Based on temporal locality
• Simple but not good enough• > 3% IPC degradation
7
Cache Miss
STT-RAM
Write-miss
Read-miss
SRAM
-
The MESI Coherent Protocol
• Developed by University of Illinois• Illinois MESI
• For each cache block• M (modified) state – data dirty, exclusive copy• E (exclusive) state – data clean, exclusive copy• S (shared) state – data clean, multiple copies• I (invalid) state
• Common event bus• Local (processor) read/write• Remote (snoop / bus) read/write 8
-
Cache Block Management (cont.)
9
• Immediate transfer policy (IT)• Place dirty data (M state) block in SRAM• Place clean data (E/S state) block in STT-RAM• Transfer cache block when coherent state changes• DO NOT need extra information (built-in by MESI)
-
Immediate Transfer Policy (IT)
10
I
S I
Remote Read
Local Write
E
M
SRAM
STT-RAM
Write Miss
Read Miss
-
Cache Block Management (cont.)
11
• Delayed transfer policy (DT)• IT could be too aggressive
• Coherent state “ping-pong” between M and S
• Relax state restriction• Consider request history in prediction• Extra information required
-
Delayed Transfer Policy (DT)
12
M
E I
E
M
I S
S2nd Local Write (consecutive)
2nd Remote Read
1st Remote Read
1st Local Write
SRAM
STT-RAM
-
Evaluation
• PARSEC on MARSSx86 [Patel et al.11]• IPC (Instruction Per Cycle)
• NVSim [Dong et al. 12]• Latency, area and energy numbers (32nm)
• Configuration• Quadcore machine with two-level cache hierarchy • Relaxed STT-RAM’s non-volatility with a 26.5µs
retention period [Sun et al. 11]• Various cache size combinations within the
baseline area budget (64KB SRAM) 13
-
Normalized IPC (IT policy)
14
95
96
97
98
99
100
101
102(%)
4KB SRAM +64KB STT-RAM 8KB SRAM +64KB STT-RAM16KB SRAM +64KB STT-RAM 4KB SRAM +128KB STT-RAM
106.6
93 92
102.9102.1
-
Normalized Energy (IT policy)
15
50
55
60
65
70
75
80
85
90(%)
4KB SRAM + 64KB STT-RAM 8KB SRAM + 64KB STT-RAM16KB SRAM + 64KB STT-RAM 4KB SRAM + 128KB STT-RAM
-
Comparison of Transfer Policies
16
56
59
62
65
68
71
74
94
95
96
97
98
99
100
Pure 64KBSTT-RAM
4KB SRAM +64KB STT-RAM
8KB SRAM +64KB STT-RAM
16KB SRAM +64KB STT-RAM
4KB SRAM +128KB STT-RAM
Energy (%)IPC (%) Naïve IT DT Energy
×
× × ×
×
×
× ×
×
× ××× ×
-
Impact of Retention Time
• ∆• ∆: Thermal barrier of MTJ, affected by planar
area, thickness and temperature• Range from few microseconds to 10+ years
• Lower bound (DRAM-style refresh)• #
• Example: ~4µs(64-byte block, 64K size, 3-/9-
cycle read/write latency under 3GHz clock)
17
-
Impact of Retention Time (cont.)
18
100
105
110
115
120
125
90
92
94
96
98
100
102
4µs 8µs 16µs 32µs 64µs inf
Energy (%)IPC (%)
Retention Time
IPC Energy
-
STT-RAM Endurance
• Lifespan programming cycles• SRAM and DRAM: 10^16• STT-RAM prediction [Tabrizi 07] : 10^15• STT-RAM reported [Diao et al. 07] : 10^13• SLC NAND flash: 10^5
• Writes in L1 cache• High intensity• Non-even distributed
• bodytrack: ~35% writes on one cache partition• facesim: ~50% writes on the same cache partition, ~15%
on the same block!19
-
STT-RAM Endurance (cont.)
20
• facesimPerfect
distributedWorst
PartitionWorst Block
Baseline SRAM 1,300+ years 300+ years < 360 hrs
BaselineSTT-RAM 1.3 years 0.3 years < 22 mins
Hybrid Naïve 3.5 years 1.0 year 0.9 hrHybrid IT 41.2 years 6.9 years 51.6 hrsHybrid DT 32.9 years 7.0 years 54.3 hrs
150x lifespan increases for the worst block!
-
Conclusion
• Deploy STT-RAM as L1 cache• Expensive write (latency, energy and endurance)
• Architecture solution: hybrid cache• “big.LITTLE” model
• MESI-based Hybrid L1 Cache Architecture• Small SRAM partition + large STT-RAM partition • Using built-in information from coherent protocol• Performance maintained with less energy, and
extended lifespan21
-
Q&A 22