instruction prefetching in smt(simultaneous multithreading) system and impact on the performance

20
Instruction prefetching in S Instruction prefetching in S MT(Simultaneous Multithreadi MT(Simultaneous Multithreadi ng) system and impact on the ng) system and impact on the performance performance by Choi, Jun-Shik Park, Joo Hyung

Upload: nevina

Post on 05-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance. by Choi, Jun-Shik Park, Joo Hyung. Contents. Purpose Background Theory Simulation Results Conclusion. 1. Purpose. For speed up - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Instruction prefetching in SMT(SimultanInstruction prefetching in SMT(Simultaneous Multithreading) system and impaceous Multithreading) system and impac

t on the performancet on the performance

by

Choi, Jun-Shik

Park, Joo Hyung

Page 2: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Contents

1. Purpose

2. Background

3. Theory

4. Simulation

5. Results

6. Conclusion

Page 3: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

1. Purpose

• For speed up

– To take advantage of ILP(Instruction level parallelism) and TLP(Thread level parallelism), SMT has considered.

– To reduce cache miss penalty and to use memory BW efficiently, prefetch has used.

Page 4: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

2. Background

• Traditional Processor

• Out-of-order Execution

• Cache Prefetching

Page 5: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Traditional Processor

• A traditional processor would stall during memory latency from the time data miss happen to the time data arrival.

time

Memory Latency

Stall

L1 Miss Data Arrival

Page 6: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Out-of-order Execution

• Because data and control dependencies must be observed, the processor will still stall at some point if memory latency is long.

time

Memory Latency

Stall

L1 Miss Data Arrival

Independent Instr.

Dependent Instr.

Page 7: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Cache Prefetching• Cache prefetching overcomes this restriction by bringing d

ata to the L1 cache or an on-chip buffer to avoid as much as possible of the cache miss penalty.

time

Memory Latency

Prefetch Data Arrival

L1 MissDependent Instr.

Page 8: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

3. Theory

• Simultaneous Multithreading(SMT)– Plenty of resources– Instruction level parallelism– Thread level parallelism

• Markov prefetcher

Page 9: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Prefetch Methods• Stride Prefetcher

– Memory reference separated by constant stride

• Recursive Prefetcher– Designed for linked data structure as the pattern

• Markov Prefetcher– Based on miss address

• Etc…

Page 10: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Basic Markov

1,2,3,4,3,5,1,3,6,6,5,1,1,2,3,4,5,1,2,3,4,3

<Example>

1

1

2

3

2

3

6

(20%)

(60%)

(20%)

(100%)

(100%)

(100%)

history 1 2

Page 11: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

The Address Sequence in Prefetch

• Miss address (IL1-cache miss) stream as a prediction source

• Too wide bandwidth for CPU demand

• L1 cache could make the miss address sequence less frequently

Page 12: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Problem in Realizing Pure Markov Prediction

• Programs reference millions of addresses and it is impossible to record all references in a single table

Page 13: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Prefetch TableState (1-

history)

1 prediction

2 prediction

3 prediction

4 prediction

1 2 1 3 -

2 3 - - -

3 4 5 - -

4 3 5 - -

5 1 - - -

6 5 6 - -

Page 14: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Prefetch Diagram

Prefetcher

Prefetch

Buffer

L1 Cache

L2 Cache

Memory

Address request

Miss

address

Prefetch Table

Page 15: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Prefetch Algorithm

CPU Address Request (L1 Miss)

Look up table

Prefetch Table

Y(matched)

N(not matched)

Store Prefetch Data on L2 Examining cache look up

- Data Transfer from L2 to CPU- Update or Insert Informationto Prefetch Buffer

Page 16: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

4. Simulation

• Modified Code: ss_smt-1.0 • Specification

– Thread: 2– Cache: L1(64KB), L2– Number of Instructions: 100 millions

• Used Benchmark– MCF(Integer) and ART(Floating point)– GCC(Integer) and MESA(Floating point)

Page 17: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

5. Result• Testbenches for the 2 threads

– MCF and ART• L1 miss rate = 0.0794, 0.0921• Number of L1 miss = Number of access to PFB: 23, 7

– GCC and MESA• L1 miss rate = 0.0010, 0.0009• Number of L1 miss = Number of access to PFB: 15, 13

Page 18: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Benchmark Reference 1• The following benchmarks

grow quickly to their target sizes (expressed in megabytes) and then stay there ----->

- ART

- MCF

max max num num

rsz vsz obs unchanged stable?

art 3.7 4.3 157 37 x

Page 19: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Benchmark Reference 2

• Change size over time

- GCC - MESA

max max num num

rsz vsz obs unchanged stable?

mesa 9.4 23.1 132 131 stable

Page 20: Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

6. Conclusion

• A prefetcher using Markov algorithm has simulated.

• To make Markov Prefetcher efficient in the system, it should have enough training time and L1 misses, because the prefetcher is operated on the basis of the L1 miss address sequence history.

• Disadvantage of Markov prefetcher

– High hardware cost, not a good stand-alone prefetcher