1 dief: an accurate interference feedback mechanism for chip multiprocessor memory systems magnus...
TRANSCRIPT
![Page 1: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/1.jpg)
1
DIEF: An Accurate Interference FeedbackMechanism for Chip Multiprocessor MemorySystems
Magnus Jahre†, Marius Grannaes† ‡ and Lasse Natvig†
† Norwegian University of Science and Technology‡ Energy Micro
![Page 2: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/2.jpg)
2
Chip Multiprocessor Resources
• Hardware-controlled, shared resources– Interconnect bandwidth– Shared cache capacity– Memory bus bandwidth– Memory capacity is allocated by the operating system
Interference can occur in all shared units
CPU 1
Inte
rcon
nect
MainMemory
MemoryBus
D-Cache
I-Cache
CPU 2D-Cache
I-Cache
CPU 3D-Cache
I-Cache
CPU 4D-Cache
I-Cache
Sha
red
Cac
he
Mem
ory
Con
trol
ler
Private Memory System Shared Memory System
Current CMP implementations do not take interference into
account
![Page 3: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/3.jpg)
3
Why Control Resource Allocation?
Provide predictable performance
Support OS scheduler assumptions
Cloud: Fulfill Service Level Agreement
![Page 4: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/4.jpg)
4
Resource Allocation Tasks
Measurement
Allocation(Policy)
Enforcement(Mechanism)
Focus of this work
![Page 5: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/5.jpg)
5
Resource Allocation Baselines
Baseline = Interference-free configuration
Quantify performance impact from interference
Private Mode and Shared Mode
![Page 6: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/6.jpg)
6
Multi-Programmed Baseline
• All processes in a workload run concurrently
• Static and equal partitioning of all shared resources
50%Program
B
50%Program
A
Memory Bus
Shared Cache
50%: Program B50%: Program A
Multiprogrammed Baseline
![Page 7: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/7.jpg)
7
Single Program Baseline
• The process is run alone in one core
• All other cores are idle
• Exclusive access to all shared resources
100%Program
A
Shared Cache
Memory Bus
100%: Program A
Single Program Baseline
100%Program
B
Shared Cache
Memory Bus
100%: Program B
![Page 8: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/8.jpg)
8
Baseline Weaknesses
• Multiprogrammed Baseline– Only accounts for interference in partitioned resources– Static and equal division of DRAM bandwidth does not give equal
latency– Complex relationship between resource allocation and performance
• Single Program Baseline– Does not exist in shared mode
Dynamic Interference Estimation Framework (DIEF)
![Page 9: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/9.jpg)
9
Outline
• Introduction
• Dynamic Interference Estimation Framework– Shared Cache– Memory Bus – On-chip interconnect
• Results
• Summary
![Page 10: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/10.jpg)
10
Interference Estimation
Full-System Interference EstimationAggregate interference from different units
Common unit of measureAverage Latency (Clock Cycles)
DIEFGeneral, component-based framework
![Page 11: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/11.jpg)
11
Interference Definition
InterferencePrivate Mode
Latency
Estimate ErrorPrivate
Mode Latency Measurement
Shared Mode Latency
PrivateMode Latency
Estimate
![Page 12: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/12.jpg)
12
Shared Cache Interference
B
NM
ABA M N
Auxiliary Tag Directories
CP
U 0
CP
U 1
Cache Accesses:
B
Shared Cache
...... ...
......
...
![Page 13: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/13.jpg)
13
Shared Cache Interference
B
NM
AAB M N
Auxiliary Tag Directories
CP
U 0
CP
U 1
Cache Accesses:
B
Shared Cache
...... ...
......
...
C
C
Eviction may not be interference
![Page 14: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/14.jpg)
14
Shared Cache Interference
B
NM
AAB M
Auxiliary Tag Directories
CP
U 0
CP
U 1
Cache Accesses:
B
Shared Cache
...... ...
......
...
C
C CB
N
Interference cost = miss penalty
Hit
Miss
![Page 15: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/15.jpg)
15
Bus Interference Requirements
• Out-of-order memory bus scheduling• Shared mode only cache misses and cache hits• Shared cache writebacks
Computing private latency based on shared mode queue contents is difficult
Emulate private scheduling in the shared mode
![Page 16: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/16.jpg)
16
E D
Shared Bus Queue
C B
D C B A
1202004040
Arrival Order
Head Pointer
Execution Order
15
32
Latency Lookup Table
Bank 0
Bank 1
...
...
Open Page Emulation Registers
Memory Latency Estimation Buffer
Bank/ Page Mapping: A à (0,15), B à (0,19), C à (0,15), D à (1,32)
Estimated Queue Latency 120 40 40+ +=
BCD 40200
![Page 17: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/17.jpg)
17
Interconnect Interference
A
F E
BCCPU 0
CPU 1
L2 Bank 0
L2 Bank 1
Interference Counters
0 0
A
E
48
CPU 1 delays CPU 0
![Page 18: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/18.jpg)
18
Outline
• Introduction
• Dynamic Interference Estimation Framework– Shared Cache– Memory Bus – On-chip interconnect
• Results
• Summary
![Page 19: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/19.jpg)
19
Relative Estimation Errors
1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C4 Cores 8 Cores 16 Cores 4 Cores 8 Cores 16 Cores
Crossbar Ring
-4 %
0 %
4 %
8 %
Ave
rag
e R
elat
ive
Err
or
![Page 20: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/20.jpg)
20
RMS Error Breakdown
1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C4 Cores 8 Cores 16 Cores 4 Cores 8 Cores 16 Cores
Crossbar Ring
0
20
40
60
80
100
Bus Queue Bus ServiceInterconnect Request Queue
Su
m o
f A
vera
ge
Per
-B
ench
mar
k P
er-U
nit
RM
S
Err
or
(clo
ck c
ycle
s)
Remaining units contribute less than 2 clock cycles
![Page 21: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/21.jpg)
21
Auxiliary Tag Directory Accuracy
1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C4 8 16 4 8 16
Crossbar Ring
-2 %
0 %
2 %
Rel
ativ
e M
iss
Est
imat
e E
rro
r
![Page 22: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/22.jpg)
22
Outline
• Introduction
• Dynamic Interference Estimation Framework– Shared Cache– Memory Bus – On-chip interconnect
• Results
• Summary
![Page 23: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/23.jpg)
23
Summary• Memory system interference causes unpredictable
performance
• DIEF provides– Accurate private mode latency estimates– Accurate shared mode latency measurements
• Future opportunities– Guiding dynamic optimizations– Guiding OS scheduling decisions– Debugging and optimization
![Page 24: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/24.jpg)
24
Thank you!
Visit our website:http://research.idi.ntnu.no/multicore/
Questions?
![Page 25: 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian](https://reader038.vdocuments.net/reader038/viewer/2022103123/56649d1f5503460f949f24ba/html5/thumbnails/25.jpg)
25
Experiment Methodology
• M5 simulator– Extended with crossbar and ring on-chip interconnect models– DDR2 memory bus model
• Randomly generated workloads of SPEC2000 benchmarks– 40 4-core workloads– 20 8-core workloads– 10 16-core workloads