characterizing and optimizing the performance of...

Characterizing and Optimizing the Performance of Multithreaded Programs Under Interference

Yong Zhao, Jia Rao The University of Texas at Arlington

1

Qing Yi University of Colorado Colorado Springs

PACT 2016@Haifa, Israel

Cloud Computing

Cloud computing, powered by warehouse-scale data centers, provides users with abundant parallelism and potentially unlimited scalability

• Characteristics

- Virtualization

- Scalability and elasticity

- Economy of scale

- Multi-tenancy and workload consolidation

2

3

Model vCPU Mem (GB)t2.nano 1 0.5t2.micro 1 1t2.small 2 2

t2.medium 2 4t2.large 2 8

Ideal for Running Parallel Programs

Model vCPU Mem (GB)m4.large 2 8

m4.xlarge 4 16m4.2xlarge 8 32m4.4xlarge 16 64m4.10xlarge 40 160

General Purpose

Compute OptimizedModel vCPU Mem (GB)

t2.nano 2 3.75t2.micro 4 7.5t2.small 8 15

t2.medium 16 30t2.large 36 60

(Use Cases: small databases, data processing tasks, cluster computing etc.)

(Use Cases: HPC applications, batch processing, distributed analytics etc.)

Memory Optimized

Model vCPU Mem (GB)

x1.32xlarge 128 1952

(Use Cases: HPC applications, Apache Spark, Presto etc.)

https://aws.amazon.com/ec2/instance-types/— From

Parallel Performance in the Cloud

Suboptimal and unpredictable performance due to multi-tenant interference

0

1

2

3

4

5

streamcluster

cannealfacesim

fluidanimate

swaptions

dedupraytrace

lu sp

Alone w/ streamcluster w/ fluidanimateNormalized slowdown

4

Shared resources (e.g., LLC, Mem) CPU multiplexing

Slowdown on individual threads

Overall slowdown

Memory footprint, access

pattern …

Fair sharing algorithm …

Sync methods, load balancing??? VM2

vCPU1VM2

vCPU2VM2

vCPU3VM2

vCPU4

VM1vCPU1

VM1vCPU2

VM1vCPU3

VM1vCPU4

pCPU1 pCPU2 pCPU3 pCPU4

Parallel program

Interference

4.3x2.8x

1.3x

Related Work

5

• Reducing sync delays - BWS [USENIX ATC’ 14]

- Demand-based coordinated scheduling [ASPLOS’ 13]

- Balancing scheduling [EuroSys’ 13]

- Adaptive scheduling [HPDC’ 11]

- Lock-aware scheduling [VM’ 04]

- Relaxed co-scheduling [VMWare]

• Modeling shared resource contention - numerous excellent work: Qureshi et al., [MICRO’06], Suh et al., [HPCA’02],

Tam et al., [ASPLOS’09], Chandra et al., [HPCA’05], Guo et al., [SIGMETRICS’06], Iyer et al., [SIGMETRICS’07], Moscibroda et al., [USENIX Security’07], Mutlu et al., [MICRO’07, ISCA’08], Blagodurov et al., [TOCS’10], Xiang et al., [ASPLOS’13]

…

Our Focus

Parallel programs are complex

- Parallel models

- Synchronization methods

- User-level work assignment

- Inter-thread data sharing

6

Why parallel performance is so difficult to reason about under interference?

Interference is dynamic

- Contention on the memory hierarchy

- Contention on CPU cycles

- Interplays between the two

We focus on studying how parallel programs respond to interference

Synthetic Interferenes

7

• The effects of interference

- Slowdown individual parallel threads

- Cause asynchrony among threads

• Synthetic interferences to abstract these effects

- Slowdown threads -> reducing CPU allocations

- Asynchrony -> stoping threads at different times

• Synthetic Interferences: zero memory footprint, controllable

• Persistent: simple while(1) CPU hog

• Periodic: demands CPU at regular intervals otherwise stays idle (10ms busy - 10ms idle)

• Intermittent: demands CPU at irregular intervals (busy periods randomly selected from 1ms to 40ms, idle periods match the busy periods at each interval)

Profiling Parallel Performance

• Decomposing parallel runtime

• compute - computation + memory access time

• sync - blocking or spinning time

• steal - time that the multi-tenant system is serving other users

• Recording the performance events

8

Event Description

YIELD Voluntary yield to other vCPUs due to idling

PREEMPT Involuntary preemption by the hypervisor

IDLE The time pCPU in the idle state

HARDWARE Statistics

Hardware performance counters (MPKI/L2 Cache Miss/L3 Reference etc.)

Parallel runtime = compute + sync + steal

Methodology

Differential analysis

- Compare the parallel runtime breakdown under interference with that in an interference-free environment to identify the causes of performance degradation

9

Dedicated execution mode Normal modeNormal mode

Temporarily throttle co-running workloads

Resume normal execution

sample period sample period

Varying CPU Allocation

10

steal

compute

= 1

Fair CPU allocation between workloads

steal

compute

< 1

More resilient to interference

0

0.25

0.5

0.75

1

streamcluster

cannealfacesim

fluidanimate

swaptions

dedupraytrace

lu sp

Steal time relative to compute time✓ Common in popular hypervisors: Xen and KVM ✓ Due to vCPU preemption and prioritization ✓ Affect parallel programs with blocking synchronizations

- Steal time should be considered to understand parallel performance in shared systems

- Steal time can be managed by controlling preemption

Foreground: four-thread parallel programs Background: single-thread persistent interference

Compute Time Changes under Interference

Parallel runtime breakdown

Compute = computation + data access time

(almost) invariant dynamic

0

0.5

1

1.5

2

2.5

streamcluster sp bt lu

No inter.w/ inter.

Steal Sync Compute

11

compute time dropped by as much as 50%!

Foreground: four-thread parallel programs Background: single-thread persistent interference

Reduced Data Access Time

12

(a) OFFCORE_STALL

Rela

tive

to n

o in

terfe

renc

e

0

0.5

1

1.5

stcluster sp bt lu

w/o inter.w/ inter.

(b) MPKI

0

0.5

1

1.5

stcluster sp bt lu

w/o inter.w/ inter.

(c) L3 reference

0

0.5

1

1.5

stcluster sp bt lu

w/o inter.w/ inter.

Lessons learned

✓ Memory access time increases due to inter-program contentions on shared resources.

✓ Memory cost can drop due to alleviated intra-program contentions, e.g., less coherence misses

0

0.5

1

1.5

stcluster sp bt lu

(c) L3 reference

No inter.Persistent inter.

0

0.5

1

1.5

stcluster sp bt lu

(d) L2 cache miss

Capacity/conflict missCoherence miss

(d) L2 cache misses

Hardware statistics: the sum of all threads

Varying Memory Cost Under Interference

13

Changes to Offcore Stalls (%)

-50

-25

025

50

75

100

cg ep ua mg

ft is cannealfluidanim

atesw

aptionsferretblackscholesraytracex264dedup

Increase in memory cost: loss of locality

Reduction in memory cost:mitigation of intra-program contentions

No inter. w/ inter.

CPU% 400% 363% (-9.2%)

Runtime 1004s 914s (+9.0%)

Case study: better sp performance with less resources

Lessons learned ✓Memory access cost changes under interference in most

parallel applications

✓ Avoid inter-program contention and exploit the mitigation of intra-program contention in workload consolidation

✓ Compute time must be closely monitored to understand parallel performance

(a) streamcluster

0

1

2

3

No inter. Persistent Periodic Intermittent

Compute Sync Steal

Complex Interactions with the Scheduler (1/2)

(b) canneal

0

1

2

3


Compute Sync Steal

14

(a) streamcluster

Nor

mal

ized

val

ue0

1

2

steal

sync

idle

yieldpreempt

IntermittentPeriodic

(b) canneal

0

1

2

steal

sync

idle

yieldpreempt

IntermittentPeriodic

Different applications exhibit different degree of degradations under the same type of interference

Lessons learned: ✓ Reducing the number of preemptions would help

improve performance under interference.

✓ System idle time is a good indicator of scheduling efficiency.

Foreground: four-thread parallel programs with blocking sync Background: four-thread interferences

Complex Interactions with the Scheduler (2/2)

(c) sp

0

1

2

3


Compute Sync Steal

(d) lu

0

1

2

3


Compute Sync Steal

15

Lessons learned: ✓ Fine-grained scheduling helps stop spinning vCPUs in a

timely manner so that the overall sync time is reduced

✓ Out-of-sync execution due to intermittent interference helps reduce memory access time

Foreground: four-thread parallel programs with spinning sync Background: four-thread interferences

Nor

mal

ized

val

ue

Online Performance Prediction• Objective: predict the overall slowdown before completing the program.

• Method: sample parallel execution under contention and compare it to a reference profile.

• Design: compare the amount of useful work done in two samples assuming an ideal memory system with zero latency and perfect load balancing.

✓ ttotal is the sampling period and tsteal can be directly measured ✓ tmem can be approximated by OFFCORE_STALL on Intel processors ✓ tsync due to spinning is accounted using our BPI-based spin detection [PPoPP’14] ✓ tsync due to blocking is the sum of blocked time and context switch costs

tideal = ttotal - tsteal - tsync - tmem

slowdown = t’ideal / tideal

16

Prediction AccuracyPr

edic

tion

base

d on

M

PKI (

%)

-40

-20

0

20

401-loop 4-loop 1-stcluster 4-stcluster

17

Closer to zero is better!

Prediction is based on the sum of metrics on all threads

Pred

ictio

n ba

sed

on

usef

ul w

ork

(%)

-40-30-20-10

01020

sp bt lu cg ep mg ua streamcluster

cannealfluidanimate

blackscholes

swaptions

raytrace

Pred

ictio

n ba

sed

on

CPI

(%)

-20-10

0102030

4.5% mean absolute percentage error (MAPE)

Not accurate for programs with frequent phase changes

Performance Optimizations

Insight-1: uncontrolled preemption is a major source of unpredictability and inefficiency

Insight-2: memory-bound programs benefit from out-of-sync execution

18

Delayed Preemption (DP): minimize pre-mature/involuntary preemptions

Differential Scheduling (DS): intentionally create out-of-sync

execution with differential time slices

Delayed Preemption

19

• Objective: Maximize execution efficiency by minimizing idle time

• Method: temporarily delays a wakeup vCPU in the hope that the current running vCPU would voluntarily yield CPU

• Design: the selection of preemption delay

- static delay: 2ms, 8ms.

- adaptive delay: adjusting the preemption delays according to system idle time

(a) Co-run with streamcluster

Nor

mal

ized

runt

ime

0

0.5

1

1.5

2

2.5Xen DP(2ms) DP(8ms) Adaptive-DP

(b) Co-run with fluidanimate

Nor

mal

ized

runt

ime

00.5

11.5

22.5

3

streamclusterfacesim

cannealraytracefluidanimate

vips

dedup

ferret

swaptionsblackscholesbodytrack

x264

Xen DP(2ms) DP(8ms) Adaptive-DP

Loweris better!

Loweris better!

Delayed Preemption - Results

Adaptive-DP outperforms Xen by 12% in overall performance and runtime variation is minimized

Differential Scheduling

21

• Objective: Mitigate intra-program contention on the memory hierarchy and reduce wasteful spin time

• Method: create out-of-sync execution on different CPUs by assigning them different time slices

• Design: randomly select time slices from [10ms, 30ms] but ensure that the mean time slices on different CPUs are the same

(a) Persistent

Nor

mal

ized

runt

ime

0

0.5

1

1.5

sp lu cg ua ft bt

Xen DS

(b) Periodic

0

0.5

1

1.5

sp lu cg ua ft bt

Xen DS

(c) Corun with SP

0

0.5

1

1.5

sp lu cg ua ft bt

Xen DS

Loweris better!

Differential Scheduling - Results

On average, DS outperforms Xen by 32% and significantly reduces runtime variations

Conclusions

23

• Objective: uncover the causes of performance degradation and unpredictability of multi-threaded parallel programs in consolidated systems

• Method: synthetic interference + parallel runtime breakdown + differential analysis

• Findings:

- Existing CPU scheduling algorithms fail to predictably and fairly allocate CPU time to programs with various demand patterns

- The behaviors of parallel programs as a whole change under interference, e.g., placing varying pressure on the memory hierarchy or having changing overall CPU demands

• Results:

- Identify the invariant in parallel runtime to perform online prediction

- Propose two scheduling optimizations: Delayed Preemption and Differential Scheduling

Acknowledgement

Questions?

characterizing and optimizing the performance of...

Documents