cache sharing with trict o for latency critical...

UBIK: EFFICIENT CACHE SHARING WITH STRICT QOS

FOR LATENCY CRITICAL WORKLOADS

HARSHAD KASTURE, DANIEL SANCHEZ

ASPLOS 2014

Motivation 2

Low server utilization in datacenters is a major source of

inefficiency

L. Barroso and U. Hölzle, The Case for Energy-Proportional Computing

Common Industry Practice 3

Core 0

Last Level Cache

Core 1 Core 2

Core 3 Core 4 Core 5

Latency Critical Application

Dedicated machines for latency-critical applications

guarantees QoS

Core 0

Last Level Cache

Core 1 Core 2


Common Industry Practice 4

Latency Critical Application

Dedicated machines for latency-critical applications guarantees QoS

Under utilization of machine resources

Core 0

Last Level Cache

Core 1 Core 2


Colocation to Improve Utilization 5

Latency Critical Applications

Can utilize spare resources by colocating batch apps

Core 0

Last Level Cache

Core 1 Core 2


Sharing Causes Interference! 6

Latency Critical Applications

Can utilize spare resources by colocating batch apps

Contention in shared resources degrades QoS

Outline 7

Introduction

Analysis of latency-critical apps

Inertia-oblivious cache management schemes

Ubik: Inertia-aware cache management

Evaluation

Understanding Latency-Critical Applications 8

Large number of backend servers participate in handling every user request

Total service time determined by tail latency behavior of backend

Client

Client

Client

Front End

Back End

Back End

Back End

Front End

Datacenter


Service latency highly sensitive to changes in load


Short bursts of activity interspersed with idle periods

Need guaranteed high performance during active periods

Active

Idle

Time

Inertia and Transient Behavior 11

Core 0 Core 1

Last Level Cache

Core 2


Time

IPC

Inertia and Transient Behavior 12

Transient lengths can dominate tail latency!

Any dynamic reconfiguration scheme has to be inertia-aware

Many hardware resources exhibit inertia

branch predictors, prefetchers, memory bandwidth…

LLCs are one of the biggest sources of inertia

Core 0 Core 1

Last Level Cache

Core 2


Time

IPC

Transient begin Transient end

Outline 13

Introduction




Evaluation

Inertia-Oblivious Cache Management 14

Last Level Cache (LLC)

Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

Active

Idle

Active

Idle

Active

Idle

Active

Idle Time

Unmanaged LLC (LRU Replacement) 15


Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

Active

Idle

Active

Idle

Active

Idle

Active

Idle Time

Time

LLC

Space

✖ Unconstrained interference

results in poor tail-latency

behavior

Utility Based Cache Partitioning (UCP) 16

Time

LLC

Space

Active

Idle

Active

Idle

Active

Idle

Active

Idle Time

Reconfigure


Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

✔ High batch throughput

✖ Poor tail latency (low allocation)

OnOff: Efficient but Unsafe 17

Active

Idle

Active

Idle

Active

Idle

Active

Idle Time

Batch Reconfigure Time

LLC

Space


Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

✔ High batch throughput

Cross-Request LLC Inertia 18

Other applications qualitatively similar (see paper for details)

Shore-MT, 2 MB LLC

Cross-request hits

LLC

Acc

ess

Bre

akd

ow

n (%

) Misses

Hits (same request)

StaticLC: Safe but Inefficient 19

Batch Reconfigure Time

LLC

Space

Active

Idle

Active

Idle

Active

Idle

Active

Idle Time


Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

✔ Low tail latency (preserve LLC state)

✖ Low batch throughput (poor space

utilization)

Outline 20

Introduction




Evaluation

Ubik: Performance Guarantee 21

Performance as well as overall progress under Ubik after

the deadline is identical to static partitioning

Instructions

Time

deadline

Progress

with constant

size

Progress

with Ubik

Request begins

Ubik: Overview 22

Activity

Time

Size

Time

nominal static

size

idle

size

Target Size

Actual Size

Ubik: Overview 23

Activity

Time

Size

Time

Target Size

Actual Size nominal static

size

idle

size

boosted

size

Ubik: Overview 24

Activity

Time

Size

Time

Target Size

Actual Size nominal static

size

idle

size

boosted

size

Ubik: Overview 25

Constraint: Cycles lost during should be compensated

for by the cycles gained during before the deadline

Activit

y

Time Size

Time

Target Size

Actual Size nominal static size

idle size

boosted size

Analyzing Transients 26

s2

Size

Time

s1

Target

Size Actual

Size

Ttransient

Need accurate predictions for

The length of the transient from s1 to s2

Cycles lost during the transient from s1 to s2

Instructions

Time

Progress with

constant size (s2)

Progress

with Ubik

Transient

begins Transient ends

Lost

Performanc

e

Hardware Support 27

Utility monitors to measure per-application miss curves

Fine grained cache partitioning

Memory Level Parallelism (MLP) profiler

pS2

Size s1

pS1

s2

Miss probability

Bounds on Transient Behavior 28

s2

Size

Time

s1

Target

Size Actual

Size

Ttransient

Instructions

Time

Progress with

constant size (s2)

Progress

with Ubik

Transient

begins Transient ends

Lost

Performance (L)

Ttransient

c

ps

M s s

1

s21 s

2 s

1 c

ps2

M

L M 1 ps2

ps

Ms s

1

s21 s

2 s

1 1 p

s2

ps1

Ubik: Partition Sizing 29

Use transient analysis to identify feasible (idle size,

boosted size) pairs

Size

Time

1

deadline



boosted size) pairs

Size

Time

1

2

deadline



boosted size) pairs

Size

Time

1

2

3

deadline



boosted size) pairs

I N F E A S I B L E Size

Time

1

2

3

4

deadline



boosted size) pairs

Choose the pair that yields the maximum batch throughput

Size

Time deadline

1

2

3

See paper for details

Outline 34

Introduction




Evaluation

Workloads 35

Five diverse latency-critical apps

xapian (search engine)

masstree (in-memory key-value store)

moses (statistical machine translation)

shore-mt (multi-threaded DBMS)

specjbb (java middleware)

Batch applications: random mixes of SPECCPU 2006

benchmarks

Target System 36

6 OOO cores

Private L1I, L1D and L2

caches

12MB shared LLC

400 6-app mixes: 3

latency-critical + 3 batch

apps

Apps pinned to cores

Core 0

L3

Bank 0

Core 1

L3

Bank 1

Core 2

L3

Bank 2

Core 3

L3

Bank 3

Core 4

L3

Bank 4

Core 5

L3

Bank 5

Batch1 Batch2 Batch3

LC1 LC2 LC3

Metrics 37

Baseline system has

private LLCs

We report

Normalized tail latency

Throughput improvement

for batch applications Core 0

L3 0

Core 1

L3 1

Core 2

L3 2

Core 3

L3 3

Core 4

L3 4

Core 5

L3 5

Batch1 Batch2 Batch3

LC1 LC2 LC3

Results: Unmanaged LLC (LRU) 38

Higher is better

Results: UCP 39

Higher is better

Results: OnOff 40

Higher is better

Results: StaticLC 41

Higher is better

Results: Ubik 42

Higher is better

Results: Summary 43

Higher is better

OnOff

UCP

Ubik

LRU

StaticLC Private LLC

Conclusions 44

To guarantee tail latency, dynamic resource management

schemes must be inertia-aware

Ubik: Inertia-aware cache capacity management

Preserves tail of latency-critical apps

Achieves high cache space utilization for batch apps

Requires minimal additional hardware

THANKS FOR YOUR ATTENTION!

QUESTIONS?

cache sharing with trict o for latency critical...

Documents