applying control theory to the caches of multiprocessors department of eecs university of tennessee,...
TRANSCRIPT
Applying Control Theory to the Caches of Multiprocessors
Department of EECSUniversity of Tennessee, Knoxville
Kai Ma
2
Applying Control Theory to the Caches of Multiprocessors
Shared L2 cache is one of the most important on-chip shared resource. Largest area and leakage power consumer One of the dominant players in terms of performance
Two Papers: Relative Cache Latency Control for Performance Differentiations in
Power-Constrained Chip Multiprocessors SHARP Control: Controlled Shared Cache Management in Chip
Multiprocessors
Relative Cache Latency Control for Performance Differentiations in Power-
Constrained Chip Multiprocessors
Department of EECSUniversity of Tennessee, Knoxville
Xiaorui Wang, Kai Ma, Yefu Wang
4
Background
NUCA (Non Uniform Cache Architecture)
Key idea: Different cache banks have different access latencies.
13
5
Introduction The power of the cache part needs to be constrained.
With controlled power, the performance of the caches also need to be guaranteed. Why control relative latency (the ratio between the average
cache access latencies of two threads)?
1. Accelerate critical threads 2. Reduce priority inversion
6
System Design
Thread 1 on core 1
Thread 0 on core 0
Thread 3 on core 3
Latency Monitor
Thread 2 on core 2
Relative Latency Controller
Cache Resizing and Partitioning Modulator
Power Monitor
Power Controller
Latency Monitor Latency Monitor
Latency Monitor
Relative Latency Controller
Relative Latency Controller
Shared L2 Cache
Relative Latency Control Loop
Power Control Loop
Cache bank of Thread 0
Cache bank of Thread 2
Cache bank of Thread 3
Cache bank of Thread 1
Inactive cache bank
7
Relative Latency Controller (RLC)
New cache ratio RLRLC
Relative latency set point
• PI (Proportional-Integral) controller System modeling Controller design Control analysis
1.5
Error: 0.3Increase 0.2
Workload variation Total cache size variation
1.5
Shared L2 caches
1.2
8
21
11
)()()(n
jij
n
jji jkcbjklakl
Relative Latency Model
is the relative latency between and core is the cache size ratio between and core
RL model
System identification Model orders Parameters
21,nn
ii ba ,0.25 0.17 0.17
0.22 0.17 0.17
0.18 0.15 0.15
01 n 11 n 21 n
12 n
22 n
32 n
Model Orders and Error
)(klithi thi )1(
thi thi )1( ic
9
Controller Design PID controller
Proportional Integral
Design: Root Locus
New cache ratio Relative latencyRelative Latency
set point
Error
)(ke )(
)(
keK
keK
I
P
)1()()1()( 21 keKkeKkckc ii
Shared L2 caches
10
Control Analysis
Derive the transfer function of the controller
Derive the transfer function of the system with system model variations
Derive the transfer function of the close-loop system and compute the poles
The control period of the power control loop is selected to be longer than the settling time of the relative latency control loop.
)1()1(')( 11 kcbklakl ii
Stability range:
18.1'69.0 1 a
11
Power Controller is the total cache size in the power control period. is the cache power in the power control period. are the parameters depended on applications
System Model Leakage power is proportional to the cache size. Leakage power counts for the largest portion of cache
power.
PI Controller
Controller analysis: and
( ) * ( )p k c s k d
( )p k( )s k thk
thk,c d
0'c 76.0' c
12
Simulation Simulator
Simplescalar with NUCA cache (Alpha 21264 like core)
Power reading Dynamic part: Wattch (with CACTI) Leakage part: Hotleakage
Workload Selected workloads from SPEC2000
Actuator Cache bank resizing and partitioning
3
7
11
15
1
4
8
12
2
5
9
13
6
10
14
16
3
7
11
15
1
4
8
12
2
5
9
13
6
10
14
16
3
7
11
15
1
4
8
12
2
5
9
13
6
10
14
16
3
7
11
15
1
4
8
12
2
5
9
13
6
10
14
16
13
Single Control Evaluation
Switch workloads here
RLC set point change Power controller set point change
Workload switch Total cache bank count change
14
Relative Latency & IPC
15
Coordination
Cache access latencies and IPC values of the four threads on the four cores of the CMP.
Cache access latencies and IPC values of the two threads on Core 0 and Core 1 for different benchmarks.
16
Conclusions Relative Cache Latency Control for Performance
Differentiations in Power-Constrained Chip Multiprocessors
Simultaneously control power and relative latency
Achieve desired performance differentiations
Theoretically analyze the single loop control and coordinated system stability
SHARP Control: Controlled Shared Cache Management in Chip Multiprocessors
Shekhar Srikantaiah, Mahmut Kandemir, *Qian Wang
Department of CSE
*Department of MNE
The Pennsylvania State University
18
Introduction Lack of control over shared on-chip resource
Faded performance isolation Lack of Quality of Service (QoS) guarantee
It is challenging to achieve high utilization meanwhile guaranteeing the QoS. Static/dynamic resource reservations may lead to low
resource utilization. Existing heuristics adjustment cannot provide theoretical
guarantee like “settling time” or “stability range”.
19
Contribution Two-layer control theory based SHARP (SHAred
Resource Partitioning) architecture Propose an empirical model Design a customized application controller (Reinforced
Oscillation Resistant controller) Study two policies can be used in SHARP
SD (Service Differentiation) FSI (Fair Speedup Improvement)
( )
1 ( )
i
i
Napp base
i app scheme
NFS
IPC
IPC
20
System Design
21
Why not PID? Disadvantages of PID (Proportional-Integral-
Derivative) controller Painstaking to tune the parameters Hard to be integrated with hierarchical architecture Sensitive to model variation during run time Static parameters Generic controller (not problem-specific) Linear model based controller
22
Application Controller
23
Pre-Actuation Negotiator (PAN) Map an overly demanded cache partition to a
feasible partition
Policies:
SD (Service Differentiation )
FSI (Fair Speedup Improvement )
))1((
0
*
N
ii
ii
w
spillwwfloorw
N
ii Wwspillw
0
24
SHARP Controller Increase IPC set points when cache ways are under
utilized
FSI & SD policies
The proof of guaranteed optimal utilization
N
j jrefjout
j
j
N
j jrefii
PtP
tw
WPtP
0
*
0*
))1(
)1((
)(
25
Experimental Setup Simulator : Simics (Full system simulator)
Operating System: Solaris 10
Configuration (2, 8 cores)
Workload: 6 mixes of applications selected from SPEC2000
26
Evaluation (Application Controller)
Long run results of PID controller and ROR controller
27
Evaluation (FSI)
SHARP vs Baselines
28
Evaluation (SD)
Adaptation of IPC with the SD policy using the ROR controllers.
29
Sensitivity & Scalability
Sensitivity analysis for different reference points
Scalability (8 cores)
30
Conclusion SHARP Control: Controlled Shared Cache
Management in Chip Multiprocessor Propose and design the SHARP control architecture for
shared L2 caches Validate SHARP with different management policies (FSI or
SD) Achieve desired FS and SD specifications
31
Critiques (1)
How to decide the relative latency set point?
For accelerating critical thread purpose, the parallel workloads may be more applicable.
32
Critiques (2)
No stability proof
Insufficient description about how to update the parameters for the application controllers
33
ComparisonRelative latency control with the power constraint
SHARP control architecture
Goal Guarantee NUCA L2 cache relative latency with different power budget
Improve the normal L2 cache utilization while guaranteeing the QoS metrics
Design Two-layer hierarchical design
Two-layer hierarchical design
Controller PID ROR
Coordination & Stability Yes No
Actuator Cache bank resizing and partitioning
Cache way resizing and partitioning
Evaluation Simplescalar Simics
34
Q & A
Thank you
35
Backup Slides Start
36
Relative Controller Evaluation (2)
37
Application Controller Evaluation (2)
38
Guaranteed Optimal Utilization Proof are time varying coefficient depended on applications,i iK
*
*
0
*
0
*
**
0
( ) ( )
( 1) ( )
( 1)
( 1)( )
( )
( )
( 1)( )
( )( )
i i i
refi i i
N
i ii
refi i N
refi i
i
ii out
j
refi i N
refiiout
i i
w t P t
P t P K t
P t W
WP t P
P
w t
P t
WP t P
w tP
P t
39
System Design