adaptive smt control for more responsive web applications · ibm research - tokyo adaptive smt...

22
IBM Research - Tokyo Oct 27, 2014 IISWC @ Raleigh, NC, USA © 2014 IBM Corporation Adaptive SMT Control for More Responsive Web Applications Hiroshi Inoue †‡ and Toshio Nakatani IBM Research Tokyo University of Tokyo

Upload: trinhkiet

Post on 15-Feb-2019

218 views

Category:

Documents


0 download

TRANSCRIPT

IBM Research - Tokyo

Oct 27, 2014 IISWC @ Raleigh, NC, USA © 2014 IBM Corporation

Adaptive SMT Control for More Responsive Web Applications

Hiroshi Inoue†‡ and Toshio Nakatani† † IBM Research – Tokyo ‡ University of Tokyo

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Response time matters!

Peak throughput has been the common metric for the

Web server performance

Even sub-second improvements in response times are

essential for better user experiences†

– Amazon: +100 msec 1% drop in sales

– Yahoo: +400 msec 5-9% drop in traffic

– Google: +500 msec 20% drop in searches

We focus on improving the response time of Web

application servers

2

† Nicole Sullivan. Design Fast Websites. Oct 14, 2008

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Key Question: How SMT affects response time?

SMT (Simultaneous Multi Threading, a.k.a. Hyper

Threading) allows multiple hardware threads to run on

one core

SMT typically

improves aggregated throughput

degrades single-thread performance

Question: How SMT affects response times of

Web application server?

3

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Outline

1. How SMT affects response time

2. Adaptive SMT control with queuing model

4

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Evaluations

Processors:

– Xeon (SandyBridge-EP): 2-way SMT, 2.9 GHz, 16 cores

– POWER7: 4-way SMT, 3.55 GHz, 16 cores

Workloads:

– PHP (MediaWiki)

– Ruby (Ruby-on-rails)

– Java (Cognos BI)

OS: Redhat Enterprise Linux 6.4 (Kernel-2.6.32)

5

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]

number of emulated clients per core

SMT1 (disabled SMT)

SMT2 (2-way SMT)

Response time of the PHP application on 16 cores of Xeon

6

higher CPU utilization lower CPU utilization

SMT improved response time

at high CPU utilization

SMT degraded response time

when CPUs were not fully loaded

SMT1 was better SMT2 was better

low

er

is fa

ste

r

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

250

300

350

400

450

500

550

600

650

700

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]

number of emulated clients per core

SMT1 (disabled SMT)

SMT2 (2-way SMT)

SMT4 (4-way SMT)

Response time of the PHP application on 16 cores of POWER7

7

low

er

is fa

ste

r

SMT degraded

response time

when CPUs were

not fully loaded

SMT1 was best SMT2 was best SMT4 was best

higher CPU utilization lower CPU utilization

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]

number of emulated clients per core

SMT1 (disabled SMT)

SMT2 (2-way SMT)

Response time of the PHP application on 1 core of Xeon

8

SMT improved

response time

even when CPUs were

not fully loaded

SMT2 was better

low

er

is fa

ste

r

higher CPU utilization lower CPU utilization

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

How SMT affects response time?

SMT hurts the response time on multicore systems with

low CPU utilization level, which is the common case in

today’s server

The crossover point depends on the number of cores

9

Low CPU utilization High CPU utilization

on 1 core improve improve

on multiple cores degrade improve

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Histogram of response time at low (~25%) CPU utilization

SMT degraded

single-thread

performance and

shifted the peak

of the histogram

towards slower

response times

SMT reduced

long-latency

transactions on 1

core

10

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

200

220

240

260

280

300

320

340

360

380

400

420

440

460

480

histogram

resp

on

se

tim

e [

mse

c]

SMT2 (2-way SMT)

SMT1 (disabled SMT)

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

200

220

240

260

280

300

320

340

360

380

400

420

440

460

480

histogram

respo

nse ti

me

[m

sec]

SMT2 (2-way SMT)

SMT1 (disabled SMT)

on 1 core of Xeon

low

er

me

an

s

faste

r re

spon

se

on 16 cores of Xeon

many long-latency transactions

on 1 core

SMT reduced the long-latency

transactions

almost no long-latency

transactions on 16 cores

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Breaking down response time

response time Tr = service time Ts + waiting time Tw SMT typically

increases service time (CPU time) by lowering single-thread

performance

reduces waiting time (in task scheduling queue) by providing

more hardware threads

SMT degrades the response time on multicore systems with low

CPU utilization level because waiting time is not significant in such

case

For other cases (single core or high utilization) waiting time affect

the total response time

11

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Outline

1. How SMT affects response time

2. Adaptive SMT control with queuing model

12

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Adaptive SMT Control

We periodically (once per 5 sec)

– obtain the CPU utilization from /proc/stat,

– calculate the response time for each SMT level

using a new queuing model, and

– select the best SMT level

Implemented as a user-space daemon without

modification in OS kernel

13

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Challenges in queuing model for SMT processors

How to model single-thread performance on SMT

processor

– affected by resource contention among the SMT threads

How to model task migration behavior of the OS task

scheduler

– aggressively balances the load among the SMT threads

within one core while minimizing migrations among different

cores

14

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Hierarchical queuing model

1.In-core modeling: model the single SMT core

– To calculate service time (i.e. single-thread

performance) and waiting time without considering task

migration

2.Out-of-core modeling: model the task migration among

cores

– To modify the waiting time considering the task

migration

Both phases are based on the standard M/M/s model

Model takes CPU utilization as input w/o task characteristics

See the paper for the model details

15

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

0.0

0.5

1.0

1.5

2.0

2.5

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

norm

aliz

ed a

vera

ge r

esponse ti

me

normalized CPU utilization (1.0 means fully utilized without SMT)

response time (SMT1) response time (SMT2)

Response time predicted by our model on 16-cores of Xeon

16

SMT1 was

better

SMT2 was

better

measured response time

of MediaWiki (PHP)

predicted response time

of MediaWiki (PHP)

200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]

number of emulated clients per core

SMT1 (disabled SMT)

SMT2 (2-way SMT)

higher CPU utilizationlower CPU utilization

SMT1 was better SMT2 was better

low

er is

fa

ste

r

higher CPU utilization lower CPU utilization

low

er

is fa

ste

r

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

0.0

0.5

1.0

1.5

2.0

2.5

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

norm

aliz

ed a

vera

ge r

esponse ti

me

normalized CPU utilization (1.0 means fully utilized without SMT)

response time (SMT1)

response time (SMT2)

response time (SMT4)

Response time predicted by our model on 16-cores of POWER7

17

SMT1 was

best

SMT4 was

best

SMT2 was

best

250

300

350

400

450

500

550

600

650

700

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]

number of emulated clients per core

SMT1 (disabled SMT)

SMT2 (2-way SMT)

SMT4 (4-way SMT)

low

er is

fa

ste

r

SMT1 was best SMT2 was best SMT4 was best

higher CPU utilizationlower CPU utilization

measured response time

of MediaWiki (PHP)

predicted response time

of MediaWiki (PHP)

low

er

is fa

ste

r

higher CPU utilization lower CPU utilization

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

0.0

0.5

1.0

1.5

2.0

2.5

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

norm

aliz

ed a

vera

ge r

esponse ti

me

normalized CPU utilization (1.0 means fully utilized without SMT)

response time (SMT1) response time (SMT2)

Response time predicted by our model on 1-core of Xeon

18

200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]

number of emulated clients per core

SMT1 (disabled SMT)

SMT2 (2-way SMT)

SMT2 was better

low

er is

fa

ste

r

higher CPU utilizationlower CPU utilization

measured response time

of MediaWiki (PHP)

predicted response time

of MediaWiki (PHP)

low

er

is fa

ste

r

higher CPU utilization lower CPU utilization

SMT2 was best

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]

number of emulated clients per core

SMT1 (disabled SMT)

SMT2 (2-way SMT)

with Adaptive SMT Control

Response time with adaptive SMT control on 16 cores of Xeon

19

SMT1 is automatically selected

with low CPU utilization

to improve response time

low

er

is fa

ste

r

SMT2 is automatically selected

with high CPU utilization

to achieve higher peak throughput

about 10% reduction in

response time compared to

the default (SMT2)

higher CPU utilization lower CPU utilization

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

250

300

350

400

450

500

550

600

650

700

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]

number of emulated clients per core

SMT1 (disabled SMT)

SMT2 (2-way SMT)

SMT4 (4-way SMT)

with Adaptive Control

Response time with adaptive SMT control on 16 cores of POWER7

20

SMT1 is

selected

SMT2 is

selected

SMT4 is

selected

higher CPU utilization lower CPU utilization about 10% reduction in

response time compared to

the default (SMT4)

low

er

is fa

ste

r

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]

number of emulated clients per core

SMT1 (disabled SMT)

SMT2 (2-way SMT)

with Adaptive Control

Response time with adaptive SMT control on 1 core of Xeon

21

SMT2 is automatically selected

regardless of the CPU utilization

low

er

is fa

ste

r

higher CPU utilization lower CPU utilization

IBM Research - Tokyo

Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Summary

We showed that SMT may degrade the response time

on multicore processors with low CPU utilization

We developed a new queuing model to predict the

response time on multicore SMT processors

Our adaptive SMT control based on the new model

automatically selected the best SMT level at runtime

22

See the paper for more detail

evaluation with Ruby and Java workloads

results on moderate number of cores

detail of the queuing model