adaptive smt control for more responsive web applications · ibm research - tokyo adaptive smt...

IBM Research - Tokyo

Oct 27, 2014 IISWC @ Raleigh, NC, USA © 2014 IBM Corporation

Adaptive SMT Control for More Responsive Web Applications

Hiroshi Inoue†‡ and Toshio Nakatani† † IBM Research – Tokyo ‡ University of Tokyo


Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation

Response time matters!

Peak throughput has been the common metric for the

Web server performance

Even sub-second improvements in response times are

essential for better user experiences†

– Amazon: +100 msec 1% drop in sales

– Yahoo: +400 msec 5-9% drop in traffic

– Google: +500 msec 20% drop in searches

We focus on improving the response time of Web

application servers

2

† Nicole Sullivan. Design Fast Websites. Oct 14, 2008



Key Question: How SMT affects response time?

SMT (Simultaneous Multi Threading, a.k.a. Hyper

Threading) allows multiple hardware threads to run on

one core

SMT typically

improves aggregated throughput

degrades single-thread performance

Question: How SMT affects response times of

Web application server?

3



Outline

1. How SMT affects response time

2. Adaptive SMT control with queuing model

4



Evaluations

Processors:

– Xeon (SandyBridge-EP): 2-way SMT, 2.9 GHz, 16 cores

– POWER7: 4-way SMT, 3.55 GHz, 16 cores

Workloads:

– PHP (MediaWiki)

– Ruby (Ruby-on-rails)

– Java (Cognos BI)

OS: Redhat Enterprise Linux 6.4 (Kernel-2.6.32)

5



200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]

number of emulated clients per core

SMT1 (disabled SMT)

SMT2 (2-way SMT)

Response time of the PHP application on 16 cores of Xeon

6

higher CPU utilization lower CPU utilization

SMT improved response time

at high CPU utilization

SMT degraded response time

when CPUs were not fully loaded

SMT1 was better SMT2 was better

low

er

is fa

ste

r



250

300

350

400

450

500

550

600

650

700

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]


SMT1 (disabled SMT)

SMT2 (2-way SMT)

SMT4 (4-way SMT)

Response time of the PHP application on 16 cores of POWER7

7

low

er

is fa

ste

r

SMT degraded

response time

when CPUs were

not fully loaded

SMT1 was best SMT2 was best SMT4 was best




200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]


SMT1 (disabled SMT)

SMT2 (2-way SMT)

Response time of the PHP application on 1 core of Xeon

8

SMT improved

response time

even when CPUs were

not fully loaded

SMT2 was better

low

er

is fa

ste

r




How SMT affects response time?

SMT hurts the response time on multicore systems with

low CPU utilization level, which is the common case in

today’s server

The crossover point depends on the number of cores

9

Low CPU utilization High CPU utilization

on 1 core improve improve

on multiple cores degrade improve



Histogram of response time at low (~25%) CPU utilization

SMT degraded

single-thread

performance and

shifted the peak

of the histogram

towards slower

response times

SMT reduced

long-latency

transactions on 1

core

10

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

200

220

240

260

280

300

320

340

360

380

400

420

440

460

480

histogram

resp

on

se

tim

e [

mse

c]

SMT2 (2-way SMT)

SMT1 (disabled SMT)

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

200

220

240

260

280

300

320

340

360

380

400

420

440

460

480

histogram

respo

nse ti

me

[m

sec]

SMT2 (2-way SMT)

SMT1 (disabled SMT)

on 1 core of Xeon

low

er

me

an

s

faste

r re

spon

se

on 16 cores of Xeon

many long-latency transactions

on 1 core

SMT reduced the long-latency

transactions

almost no long-latency

transactions on 16 cores



Breaking down response time

response time Tr = service time Ts + waiting time Tw SMT typically

increases service time (CPU time) by lowering single-thread

performance

reduces waiting time (in task scheduling queue) by providing

more hardware threads

SMT degrades the response time on multicore systems with low

CPU utilization level because waiting time is not significant in such

case

For other cases (single core or high utilization) waiting time affect

the total response time

11



Outline

1. How SMT affects response time

2. Adaptive SMT control with queuing model

12



Adaptive SMT Control

We periodically (once per 5 sec)

– obtain the CPU utilization from /proc/stat,

– calculate the response time for each SMT level

using a new queuing model, and

– select the best SMT level

Implemented as a user-space daemon without

modification in OS kernel

13



Challenges in queuing model for SMT processors

How to model single-thread performance on SMT

processor

– affected by resource contention among the SMT threads

How to model task migration behavior of the OS task

scheduler

– aggressively balances the load among the SMT threads

within one core while minimizing migrations among different

cores

14



Hierarchical queuing model

1.In-core modeling: model the single SMT core

– To calculate service time (i.e. single-thread

performance) and waiting time without considering task

migration

2.Out-of-core modeling: model the task migration among

cores

– To modify the waiting time considering the task

migration

Both phases are based on the standard M/M/s model

Model takes CPU utilization as input w/o task characteristics

See the paper for the model details

15



0.0

0.5

1.0

1.5

2.0

2.5

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

norm

aliz

ed a

vera

ge r

esponse ti

me

normalized CPU utilization (1.0 means fully utilized without SMT)

response time (SMT1) response time (SMT2)

Response time predicted by our model on 16-cores of Xeon

16

SMT1 was

better

SMT2 was

better

measured response time

of MediaWiki (PHP)

predicted response time

of MediaWiki (PHP)

200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]


SMT1 (disabled SMT)

SMT2 (2-way SMT)

higher CPU utilizationlower CPU utilization

SMT1 was better SMT2 was better

low

er is

fa

ste

r


low

er

is fa

ste

r



0.0

0.5

1.0

1.5

2.0

2.5

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

norm

aliz

ed a

vera

ge r

esponse ti

me


response time (SMT1)



Response time predicted by our model on 16-cores of POWER7

17

SMT1 was

best

SMT4 was

best

SMT2 was

best

250

300

350

400

450

500

550

600

650

700

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]


SMT1 (disabled SMT)

SMT2 (2-way SMT)

SMT4 (4-way SMT)

low

er is

fa

ste

r

SMT1 was best SMT2 was best SMT4 was best



of MediaWiki (PHP)


of MediaWiki (PHP)

low

er

is fa

ste

r




0.0

0.5

1.0

1.5

2.0

2.5

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

norm

aliz

ed a

vera

ge r

esponse ti

me


response time (SMT1) response time (SMT2)

Response time predicted by our model on 1-core of Xeon

18

200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]


SMT1 (disabled SMT)

SMT2 (2-way SMT)

SMT2 was better

low

er is

fa

ste

r



of MediaWiki (PHP)


of MediaWiki (PHP)

low

er

is fa

ste

r


SMT2 was best



200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]


SMT1 (disabled SMT)

SMT2 (2-way SMT)

with Adaptive SMT Control

Response time with adaptive SMT control on 16 cores of Xeon

19

SMT1 is automatically selected

with low CPU utilization

to improve response time

low

er

is fa

ste

r


with high CPU utilization

to achieve higher peak throughput

about 10% reduction in

response time compared to

the default (SMT2)




250

300

350

400

450

500

550

600

650

700

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]


SMT1 (disabled SMT)

SMT2 (2-way SMT)

SMT4 (4-way SMT)

with Adaptive Control

Response time with adaptive SMT control on 16 cores of POWER7

20

SMT1 is

selected

SMT2 is

selected

SMT4 is

selected

higher CPU utilization lower CPU utilization about 10% reduction in

response time compared to

the default (SMT4)

low

er

is fa

ste

r



200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20 22

resp

on

se

tim

e [m

se

c]


SMT1 (disabled SMT)

SMT2 (2-way SMT)

with Adaptive Control

Response time with adaptive SMT control on 1 core of Xeon

21


regardless of the CPU utilization

low

er

is fa

ste

r




Summary

We showed that SMT may degrade the response time

on multicore processors with low CPU utilization

We developed a new queuing model to predict the

response time on multicore SMT processors

Our adaptive SMT control based on the new model

automatically selected the best SMT level at runtime

22

See the paper for more detail

evaluation with Ruby and Java workloads

results on moderate number of cores

detail of the queuing model

adaptive smt control for more responsive web applications · ibm research - tokyo adaptive smt...

Documents