cache sharing with trict o for latency critical...
TRANSCRIPT
UBIK: EFFICIENT CACHE SHARING WITH STRICT QOS
FOR LATENCY CRITICAL WORKLOADS
HARSHAD KASTURE, DANIEL SANCHEZ
ASPLOS 2014
Motivation 2
Low server utilization in datacenters is a major source of
inefficiency
L. Barroso and U. Hölzle, The Case for Energy-Proportional Computing
Common Industry Practice 3
Core 0
Last Level Cache
Core 1 Core 2
Core 3 Core 4 Core 5
Latency Critical Application
Dedicated machines for latency-critical applications
guarantees QoS
Core 0
Last Level Cache
Core 1 Core 2
Core 3 Core 4 Core 5
Common Industry Practice 4
Latency Critical Application
Dedicated machines for latency-critical applications guarantees QoS
Under utilization of machine resources
Core 0
Last Level Cache
Core 1 Core 2
Core 3 Core 4 Core 5
Colocation to Improve Utilization 5
Latency Critical Applications
Can utilize spare resources by colocating batch apps
Core 0
Last Level Cache
Core 1 Core 2
Core 3 Core 4 Core 5
Sharing Causes Interference! 6
Latency Critical Applications
Can utilize spare resources by colocating batch apps
Contention in shared resources degrades QoS
Outline 7
Introduction
Analysis of latency-critical apps
Inertia-oblivious cache management schemes
Ubik: Inertia-aware cache management
Evaluation
Understanding Latency-Critical Applications 8
Large number of backend servers participate in handling every user request
Total service time determined by tail latency behavior of backend
Client
Client
Client
Front End
Back End
Back End
Back End
Front End
Datacenter
Understanding Latency-Critical Applications 9
Service latency highly sensitive to changes in load
Understanding Latency-Critical Applications 10
Short bursts of activity interspersed with idle periods
Need guaranteed high performance during active periods
Active
Idle
Time
Inertia and Transient Behavior 11
Core 0 Core 1
Last Level Cache
Core 2
Core 3 Core 4 Core 5
Time
IPC
Inertia and Transient Behavior 12
Transient lengths can dominate tail latency!
Any dynamic reconfiguration scheme has to be inertia-aware
Many hardware resources exhibit inertia
branch predictors, prefetchers, memory bandwidth…
LLCs are one of the biggest sources of inertia
Core 0 Core 1
Last Level Cache
Core 2
Core 3 Core 4 Core 5
Time
IPC
Transient begin Transient end
Outline 13
Introduction
Analysis of latency-critical apps
Inertia-oblivious cache management schemes
Ubik: Inertia-aware cache management
Evaluation
Inertia-Oblivious Cache Management 14
Last Level Cache (LLC)
Core 2 Core 3
Core 0 Core 1
LC1 LC2
Batch1 Batch2
Active
Idle
Active
Idle
Active
Idle
Active
Idle Time
Unmanaged LLC (LRU Replacement) 15
Last Level Cache (LLC)
Core 2 Core 3
Core 0 Core 1
LC1 LC2
Batch1 Batch2
Active
Idle
Active
Idle
Active
Idle
Active
Idle Time
Time
LLC
Space
✖ Unconstrained interference
results in poor tail-latency
behavior
Utility Based Cache Partitioning (UCP) 16
Time
LLC
Space
Active
Idle
Active
Idle
Active
Idle
Active
Idle Time
Reconfigure
Last Level Cache (LLC)
Core 2 Core 3
Core 0 Core 1
LC1 LC2
Batch1 Batch2
✔ High batch throughput
✖ Poor tail latency (low allocation)
OnOff: Efficient but Unsafe 17
Active
Idle
Active
Idle
Active
Idle
Active
Idle Time
Batch Reconfigure Time
LLC
Space
Last Level Cache (LLC)
Core 2 Core 3
Core 0 Core 1
LC1 LC2
Batch1 Batch2
✔ High batch throughput
Cross-Request LLC Inertia 18
Other applications qualitatively similar (see paper for details)
Shore-MT, 2 MB LLC
Cross-request hits
LLC
Acc
ess
Bre
akd
ow
n (%
) Misses
Hits (same request)
StaticLC: Safe but Inefficient 19
Batch Reconfigure Time
LLC
Space
Active
Idle
Active
Idle
Active
Idle
Active
Idle Time
Last Level Cache (LLC)
Core 2 Core 3
Core 0 Core 1
LC1 LC2
Batch1 Batch2
✔ Low tail latency (preserve LLC state)
✖ Low batch throughput (poor space
utilization)
Outline 20
Introduction
Analysis of latency-critical apps
Inertia-oblivious cache management schemes
Ubik: Inertia-aware cache management
Evaluation
Ubik: Performance Guarantee 21
Performance as well as overall progress under Ubik after
the deadline is identical to static partitioning
Instructions
Time
deadline
Progress
with constant
size
Progress
with Ubik
Request begins
Ubik: Overview 22
Activity
Time
Size
Time
nominal static
size
idle
size
Target Size
Actual Size
Ubik: Overview 23
Activity
Time
Size
Time
Target Size
Actual Size nominal static
size
idle
size
boosted
size
Ubik: Overview 24
Activity
Time
Size
Time
Target Size
Actual Size nominal static
size
idle
size
boosted
size
Ubik: Overview 25
Constraint: Cycles lost during should be compensated
for by the cycles gained during before the deadline
Activit
y
Time Size
Time
Target Size
Actual Size nominal static size
idle size
boosted size
Analyzing Transients 26
s2
Size
Time
s1
Target
Size Actual
Size
Ttransient
Need accurate predictions for
The length of the transient from s1 to s2
Cycles lost during the transient from s1 to s2
Instructions
Time
Progress with
constant size (s2)
Progress
with Ubik
Transient
begins Transient ends
Lost
Performanc
e
Hardware Support 27
Utility monitors to measure per-application miss curves
Fine grained cache partitioning
Memory Level Parallelism (MLP) profiler
pS2
Size s1
pS1
s2
Miss probability
Bounds on Transient Behavior 28
s2
Size
Time
s1
Target
Size Actual
Size
Ttransient
Instructions
Time
Progress with
constant size (s2)
Progress
with Ubik
Transient
begins Transient ends
Lost
Performance (L)
Ttransient
c
ps
M s s
1
s21 s
2 s
1 c
ps2
M
L M 1 ps2
ps
Ms s
1
s21 s
2 s
1 1 p
s2
ps1
Ubik: Partition Sizing 29
Use transient analysis to identify feasible (idle size,
boosted size) pairs
Size
Time
1
deadline
Ubik: Partition Sizing 30
Use transient analysis to identify feasible (idle size,
boosted size) pairs
Size
Time
1
2
deadline
Ubik: Partition Sizing 31
Use transient analysis to identify feasible (idle size,
boosted size) pairs
Size
Time
1
2
3
deadline
Ubik: Partition Sizing 32
Use transient analysis to identify feasible (idle size,
boosted size) pairs
I N F E A S I B L E Size
Time
1
2
3
4
deadline
Ubik: Partition Sizing 33
Use transient analysis to identify feasible (idle size,
boosted size) pairs
Choose the pair that yields the maximum batch throughput
Size
Time deadline
1
2
3
See paper for details
Outline 34
Introduction
Analysis of latency-critical apps
Inertia-oblivious cache management schemes
Ubik: Inertia-aware cache management
Evaluation
Workloads 35
Five diverse latency-critical apps
xapian (search engine)
masstree (in-memory key-value store)
moses (statistical machine translation)
shore-mt (multi-threaded DBMS)
specjbb (java middleware)
Batch applications: random mixes of SPECCPU 2006
benchmarks
Target System 36
6 OOO cores
Private L1I, L1D and L2
caches
12MB shared LLC
400 6-app mixes: 3
latency-critical + 3 batch
apps
Apps pinned to cores
Core 0
L3
Bank 0
Core 1
L3
Bank 1
Core 2
L3
Bank 2
Core 3
L3
Bank 3
Core 4
L3
Bank 4
Core 5
L3
Bank 5
Batch1 Batch2 Batch3
LC1 LC2 LC3
Metrics 37
Baseline system has
private LLCs
We report
Normalized tail latency
Throughput improvement
for batch applications Core 0
L3 0
Core 1
L3 1
Core 2
L3 2
Core 3
L3 3
Core 4
L3 4
Core 5
L3 5
Batch1 Batch2 Batch3
LC1 LC2 LC3
Results: Unmanaged LLC (LRU) 38
Higher is better
Results: UCP 39
Higher is better
Results: OnOff 40
Higher is better
Results: StaticLC 41
Higher is better
Results: Ubik 42
Higher is better
Results: Summary 43
Higher is better
OnOff
UCP
Ubik
LRU
StaticLC Private LLC
Conclusions 44
To guarantee tail latency, dynamic resource management
schemes must be inertia-aware
Ubik: Inertia-aware cache capacity management
Preserves tail of latency-critical apps
Achieves high cache space utilization for batch apps
Requires minimal additional hardware
THANKS FOR YOUR ATTENTION!
QUESTIONS?