instruction prefetching in smt(simultaneous multithreading) system and impact on the performance

Instruction prefetching in SMT(SimultanInstruction prefetching in SMT(Simultaneous Multithreading) system and impaceous Multithreading) system and impac

t on the performancet on the performance

by

Choi, Jun-Shik

Park, Joo Hyung

Contents

1. Purpose

2. Background

3. Theory

4. Simulation

5. Results

6. Conclusion

1. Purpose

• For speed up

– To take advantage of ILP(Instruction level parallelism) and TLP(Thread level parallelism), SMT has considered.

– To reduce cache miss penalty and to use memory BW efficiently, prefetch has used.

2. Background

• Traditional Processor

• Out-of-order Execution

• Cache Prefetching

Traditional Processor

• A traditional processor would stall during memory latency from the time data miss happen to the time data arrival.

time

Memory Latency

Stall

L1 Miss Data Arrival

Out-of-order Execution

• Because data and control dependencies must be observed, the processor will still stall at some point if memory latency is long.

time

Memory Latency

Stall

L1 Miss Data Arrival

Independent Instr.

Dependent Instr.

Cache Prefetching• Cache prefetching overcomes this restriction by bringing d

ata to the L1 cache or an on-chip buffer to avoid as much as possible of the cache miss penalty.

time

Memory Latency

Prefetch Data Arrival

L1 MissDependent Instr.

3. Theory

• Simultaneous Multithreading(SMT)– Plenty of resources– Instruction level parallelism– Thread level parallelism

• Markov prefetcher

Prefetch Methods• Stride Prefetcher

– Memory reference separated by constant stride

• Recursive Prefetcher– Designed for linked data structure as the pattern

• Markov Prefetcher– Based on miss address

• Etc…

Basic Markov

1,2,3,4,3,5,1,3,6,6,5,1,1,2,3,4,5,1,2,3,4,3

<Example>

1

1

2

3

2

3

6

(20%)

(60%)

(20%)

(100%)

(100%)

(100%)

history 1 2

The Address Sequence in Prefetch

• Miss address (IL1-cache miss) stream as a prediction source

• Too wide bandwidth for CPU demand

• L1 cache could make the miss address sequence less frequently

Problem in Realizing Pure Markov Prediction

• Programs reference millions of addresses and it is impossible to record all references in a single table

Prefetch TableState (1-

history)

1 prediction

2 prediction

3 prediction

4 prediction

1 2 1 3 -

2 3 - - -

3 4 5 - -

4 3 5 - -

5 1 - - -

6 5 6 - -

Prefetch Diagram

Prefetcher

Prefetch

Buffer

L1 Cache

L2 Cache

Memory

Address request

Miss

address

Prefetch Table

Prefetch Algorithm

CPU Address Request (L1 Miss)

Look up table

Prefetch Table

Y(matched)

N(not matched)

Store Prefetch Data on L2 Examining cache look up

- Data Transfer from L2 to CPU- Update or Insert Informationto Prefetch Buffer

4. Simulation

• Modified Code: ss_smt-1.0 • Specification

– Thread: 2– Cache: L1(64KB), L2– Number of Instructions: 100 millions

• Used Benchmark– MCF(Integer) and ART(Floating point)– GCC(Integer) and MESA(Floating point)

5. Result• Testbenches for the 2 threads

– MCF and ART• L1 miss rate = 0.0794, 0.0921• Number of L1 miss = Number of access to PFB: 23, 7

– GCC and MESA• L1 miss rate = 0.0010, 0.0009• Number of L1 miss = Number of access to PFB: 15, 13

Benchmark Reference 1• The following benchmarks

grow quickly to their target sizes (expressed in megabytes) and then stay there ----->

- ART

- MCF

max max num num

rsz vsz obs unchanged stable?

art 3.7 4.3 157 37 x

Benchmark Reference 2

• Change size over time

- GCC - MESA

max max num num

rsz vsz obs unchanged stable?

mesa 9.4 23.1 132 131 stable

6. Conclusion

• A prefetcher using Markov algorithm has simulated.

• To make Markov Prefetcher efficient in the system, it should have enough training time and L1 misses, because the prefetcher is operated on the basis of the L1 miss address sequence history.

• Disadvantage of Markov prefetcher

– High hardware cost, not a good stand-alone prefetcher

instruction prefetching in smt(simultaneous multithreading) system and impact on the performance

Documents