adaptive smt control for more responsive web applications · ibm research - tokyo adaptive smt...
TRANSCRIPT
IBM Research - Tokyo
Oct 27, 2014 IISWC @ Raleigh, NC, USA © 2014 IBM Corporation
Adaptive SMT Control for More Responsive Web Applications
Hiroshi Inoue†‡ and Toshio Nakatani† † IBM Research – Tokyo ‡ University of Tokyo
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
Response time matters!
Peak throughput has been the common metric for the
Web server performance
Even sub-second improvements in response times are
essential for better user experiences†
– Amazon: +100 msec 1% drop in sales
– Yahoo: +400 msec 5-9% drop in traffic
– Google: +500 msec 20% drop in searches
We focus on improving the response time of Web
application servers
2
† Nicole Sullivan. Design Fast Websites. Oct 14, 2008
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
Key Question: How SMT affects response time?
SMT (Simultaneous Multi Threading, a.k.a. Hyper
Threading) allows multiple hardware threads to run on
one core
SMT typically
improves aggregated throughput
degrades single-thread performance
Question: How SMT affects response times of
Web application server?
3
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
Outline
1. How SMT affects response time
2. Adaptive SMT control with queuing model
4
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
Evaluations
Processors:
– Xeon (SandyBridge-EP): 2-way SMT, 2.9 GHz, 16 cores
– POWER7: 4-way SMT, 3.55 GHz, 16 cores
Workloads:
– PHP (MediaWiki)
– Ruby (Ruby-on-rails)
– Java (Cognos BI)
OS: Redhat Enterprise Linux 6.4 (Kernel-2.6.32)
5
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
200
250
300
350
400
450
500
0 2 4 6 8 10 12 14 16 18 20 22
resp
on
se
tim
e [m
se
c]
number of emulated clients per core
SMT1 (disabled SMT)
SMT2 (2-way SMT)
Response time of the PHP application on 16 cores of Xeon
6
higher CPU utilization lower CPU utilization
SMT improved response time
at high CPU utilization
SMT degraded response time
when CPUs were not fully loaded
SMT1 was better SMT2 was better
low
er
is fa
ste
r
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
250
300
350
400
450
500
550
600
650
700
0 2 4 6 8 10 12 14 16 18 20 22
resp
on
se
tim
e [m
se
c]
number of emulated clients per core
SMT1 (disabled SMT)
SMT2 (2-way SMT)
SMT4 (4-way SMT)
Response time of the PHP application on 16 cores of POWER7
7
low
er
is fa
ste
r
SMT degraded
response time
when CPUs were
not fully loaded
SMT1 was best SMT2 was best SMT4 was best
higher CPU utilization lower CPU utilization
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
200
250
300
350
400
450
500
0 2 4 6 8 10 12 14 16 18 20 22
resp
on
se
tim
e [m
se
c]
number of emulated clients per core
SMT1 (disabled SMT)
SMT2 (2-way SMT)
Response time of the PHP application on 1 core of Xeon
8
SMT improved
response time
even when CPUs were
not fully loaded
SMT2 was better
low
er
is fa
ste
r
higher CPU utilization lower CPU utilization
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
How SMT affects response time?
SMT hurts the response time on multicore systems with
low CPU utilization level, which is the common case in
today’s server
The crossover point depends on the number of cores
9
Low CPU utilization High CPU utilization
on 1 core improve improve
on multiple cores degrade improve
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
Histogram of response time at low (~25%) CPU utilization
SMT degraded
single-thread
performance and
shifted the peak
of the histogram
towards slower
response times
SMT reduced
long-latency
transactions on 1
core
10
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
200
220
240
260
280
300
320
340
360
380
400
420
440
460
480
histogram
resp
on
se
tim
e [
mse
c]
SMT2 (2-way SMT)
SMT1 (disabled SMT)
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
200
220
240
260
280
300
320
340
360
380
400
420
440
460
480
histogram
respo
nse ti
me
[m
sec]
SMT2 (2-way SMT)
SMT1 (disabled SMT)
on 1 core of Xeon
low
er
me
an
s
faste
r re
spon
se
on 16 cores of Xeon
many long-latency transactions
on 1 core
SMT reduced the long-latency
transactions
almost no long-latency
transactions on 16 cores
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
Breaking down response time
response time Tr = service time Ts + waiting time Tw SMT typically
increases service time (CPU time) by lowering single-thread
performance
reduces waiting time (in task scheduling queue) by providing
more hardware threads
SMT degrades the response time on multicore systems with low
CPU utilization level because waiting time is not significant in such
case
For other cases (single core or high utilization) waiting time affect
the total response time
11
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
Outline
1. How SMT affects response time
2. Adaptive SMT control with queuing model
12
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
Adaptive SMT Control
We periodically (once per 5 sec)
– obtain the CPU utilization from /proc/stat,
– calculate the response time for each SMT level
using a new queuing model, and
– select the best SMT level
Implemented as a user-space daemon without
modification in OS kernel
13
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
Challenges in queuing model for SMT processors
How to model single-thread performance on SMT
processor
– affected by resource contention among the SMT threads
How to model task migration behavior of the OS task
scheduler
– aggressively balances the load among the SMT threads
within one core while minimizing migrations among different
cores
14
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
Hierarchical queuing model
1.In-core modeling: model the single SMT core
– To calculate service time (i.e. single-thread
performance) and waiting time without considering task
migration
2.Out-of-core modeling: model the task migration among
cores
– To modify the waiting time considering the task
migration
Both phases are based on the standard M/M/s model
Model takes CPU utilization as input w/o task characteristics
See the paper for the model details
15
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
0.0
0.5
1.0
1.5
2.0
2.5
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
norm
aliz
ed a
vera
ge r
esponse ti
me
normalized CPU utilization (1.0 means fully utilized without SMT)
response time (SMT1) response time (SMT2)
Response time predicted by our model on 16-cores of Xeon
16
SMT1 was
better
SMT2 was
better
measured response time
of MediaWiki (PHP)
predicted response time
of MediaWiki (PHP)
200
250
300
350
400
450
500
0 2 4 6 8 10 12 14 16 18 20 22
resp
on
se
tim
e [m
se
c]
number of emulated clients per core
SMT1 (disabled SMT)
SMT2 (2-way SMT)
higher CPU utilizationlower CPU utilization
SMT1 was better SMT2 was better
low
er is
fa
ste
r
higher CPU utilization lower CPU utilization
low
er
is fa
ste
r
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
0.0
0.5
1.0
1.5
2.0
2.5
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
norm
aliz
ed a
vera
ge r
esponse ti
me
normalized CPU utilization (1.0 means fully utilized without SMT)
response time (SMT1)
response time (SMT2)
response time (SMT4)
Response time predicted by our model on 16-cores of POWER7
17
SMT1 was
best
SMT4 was
best
SMT2 was
best
250
300
350
400
450
500
550
600
650
700
0 2 4 6 8 10 12 14 16 18 20 22
resp
on
se
tim
e [m
se
c]
number of emulated clients per core
SMT1 (disabled SMT)
SMT2 (2-way SMT)
SMT4 (4-way SMT)
low
er is
fa
ste
r
SMT1 was best SMT2 was best SMT4 was best
higher CPU utilizationlower CPU utilization
measured response time
of MediaWiki (PHP)
predicted response time
of MediaWiki (PHP)
low
er
is fa
ste
r
higher CPU utilization lower CPU utilization
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
0.0
0.5
1.0
1.5
2.0
2.5
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
norm
aliz
ed a
vera
ge r
esponse ti
me
normalized CPU utilization (1.0 means fully utilized without SMT)
response time (SMT1) response time (SMT2)
Response time predicted by our model on 1-core of Xeon
18
200
250
300
350
400
450
500
0 2 4 6 8 10 12 14 16 18 20 22
resp
on
se
tim
e [m
se
c]
number of emulated clients per core
SMT1 (disabled SMT)
SMT2 (2-way SMT)
SMT2 was better
low
er is
fa
ste
r
higher CPU utilizationlower CPU utilization
measured response time
of MediaWiki (PHP)
predicted response time
of MediaWiki (PHP)
low
er
is fa
ste
r
higher CPU utilization lower CPU utilization
SMT2 was best
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
200
250
300
350
400
450
500
0 2 4 6 8 10 12 14 16 18 20 22
resp
on
se
tim
e [m
se
c]
number of emulated clients per core
SMT1 (disabled SMT)
SMT2 (2-way SMT)
with Adaptive SMT Control
Response time with adaptive SMT control on 16 cores of Xeon
19
SMT1 is automatically selected
with low CPU utilization
to improve response time
low
er
is fa
ste
r
SMT2 is automatically selected
with high CPU utilization
to achieve higher peak throughput
about 10% reduction in
response time compared to
the default (SMT2)
higher CPU utilization lower CPU utilization
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
250
300
350
400
450
500
550
600
650
700
0 2 4 6 8 10 12 14 16 18 20 22
resp
on
se
tim
e [m
se
c]
number of emulated clients per core
SMT1 (disabled SMT)
SMT2 (2-way SMT)
SMT4 (4-way SMT)
with Adaptive Control
Response time with adaptive SMT control on 16 cores of POWER7
20
SMT1 is
selected
SMT2 is
selected
SMT4 is
selected
higher CPU utilization lower CPU utilization about 10% reduction in
response time compared to
the default (SMT4)
low
er
is fa
ste
r
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
200
250
300
350
400
450
500
0 2 4 6 8 10 12 14 16 18 20 22
resp
on
se
tim
e [m
se
c]
number of emulated clients per core
SMT1 (disabled SMT)
SMT2 (2-way SMT)
with Adaptive Control
Response time with adaptive SMT control on 1 core of Xeon
21
SMT2 is automatically selected
regardless of the CPU utilization
low
er
is fa
ste
r
higher CPU utilization lower CPU utilization
IBM Research - Tokyo
Adaptive SMT Control for More Responsive Web Applications © 2014 IBM Corporation
Summary
We showed that SMT may degrade the response time
on multicore processors with low CPU utilization
We developed a new queuing model to predict the
response time on multicore SMT processors
Our adaptive SMT control based on the new model
automatically selected the best SMT level at runtime
22
See the paper for more detail
evaluation with Ruby and Java workloads
results on moderate number of cores
detail of the queuing model