instruction prefetching in smt(simultaneous multithreading) system and impact on the performance
DESCRIPTION
Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance. by Choi, Jun-Shik Park, Joo Hyung. Contents. Purpose Background Theory Simulation Results Conclusion. 1. Purpose. For speed up - PowerPoint PPT PresentationTRANSCRIPT
Instruction prefetching in SMT(SimultanInstruction prefetching in SMT(Simultaneous Multithreading) system and impaceous Multithreading) system and impac
t on the performancet on the performance
by
Choi, Jun-Shik
Park, Joo Hyung
Contents
1. Purpose
2. Background
3. Theory
4. Simulation
5. Results
6. Conclusion
1. Purpose
• For speed up
– To take advantage of ILP(Instruction level parallelism) and TLP(Thread level parallelism), SMT has considered.
– To reduce cache miss penalty and to use memory BW efficiently, prefetch has used.
2. Background
• Traditional Processor
• Out-of-order Execution
• Cache Prefetching
Traditional Processor
• A traditional processor would stall during memory latency from the time data miss happen to the time data arrival.
time
Memory Latency
Stall
L1 Miss Data Arrival
Out-of-order Execution
• Because data and control dependencies must be observed, the processor will still stall at some point if memory latency is long.
time
Memory Latency
Stall
L1 Miss Data Arrival
Independent Instr.
Dependent Instr.
Cache Prefetching• Cache prefetching overcomes this restriction by bringing d
ata to the L1 cache or an on-chip buffer to avoid as much as possible of the cache miss penalty.
time
Memory Latency
Prefetch Data Arrival
L1 MissDependent Instr.
3. Theory
• Simultaneous Multithreading(SMT)– Plenty of resources– Instruction level parallelism– Thread level parallelism
• Markov prefetcher
Prefetch Methods• Stride Prefetcher
– Memory reference separated by constant stride
• Recursive Prefetcher– Designed for linked data structure as the pattern
• Markov Prefetcher– Based on miss address
• Etc…
Basic Markov
1,2,3,4,3,5,1,3,6,6,5,1,1,2,3,4,5,1,2,3,4,3
<Example>
1
1
2
3
2
3
6
(20%)
(60%)
(20%)
(100%)
(100%)
(100%)
history 1 2
The Address Sequence in Prefetch
• Miss address (IL1-cache miss) stream as a prediction source
• Too wide bandwidth for CPU demand
• L1 cache could make the miss address sequence less frequently
Problem in Realizing Pure Markov Prediction
• Programs reference millions of addresses and it is impossible to record all references in a single table
Prefetch TableState (1-
history)
1 prediction
2 prediction
3 prediction
4 prediction
1 2 1 3 -
2 3 - - -
3 4 5 - -
4 3 5 - -
5 1 - - -
6 5 6 - -
Prefetch Diagram
Prefetcher
Prefetch
Buffer
L1 Cache
L2 Cache
Memory
Address request
Miss
address
Prefetch Table
Prefetch Algorithm
CPU Address Request (L1 Miss)
Look up table
Prefetch Table
Y(matched)
N(not matched)
Store Prefetch Data on L2 Examining cache look up
- Data Transfer from L2 to CPU- Update or Insert Informationto Prefetch Buffer
4. Simulation
• Modified Code: ss_smt-1.0 • Specification
– Thread: 2– Cache: L1(64KB), L2– Number of Instructions: 100 millions
• Used Benchmark– MCF(Integer) and ART(Floating point)– GCC(Integer) and MESA(Floating point)
5. Result• Testbenches for the 2 threads
– MCF and ART• L1 miss rate = 0.0794, 0.0921• Number of L1 miss = Number of access to PFB: 23, 7
– GCC and MESA• L1 miss rate = 0.0010, 0.0009• Number of L1 miss = Number of access to PFB: 15, 13
Benchmark Reference 1• The following benchmarks
grow quickly to their target sizes (expressed in megabytes) and then stay there ----->
- ART
- MCF
max max num num
rsz vsz obs unchanged stable?
art 3.7 4.3 157 37 x
Benchmark Reference 2
• Change size over time
- GCC - MESA
max max num num
rsz vsz obs unchanged stable?
mesa 9.4 23.1 132 131 stable
6. Conclusion
• A prefetcher using Markov algorithm has simulated.
• To make Markov Prefetcher efficient in the system, it should have enough training time and L1 misses, because the prefetcher is operated on the basis of the L1 miss address sequence history.
• Disadvantage of Markov prefetcher
– High hardware cost, not a good stand-alone prefetcher