in n-tier applications through fine-grained analysis

Detecting Transient Bottlenecks

in n-Tier Applications through

Fine-Grained Analysis

Qingyang Wang, Yasuhiko Kanemasa, Jack Li,

Deepal Jayasinghe, Toshihiro Shimizu,

Masazumi Matsubara, Motoyuki Kawaba,

Calton Pu

Low response time is essential for Quality of Service (e.g., SLA for web-facing e-commerce applications). Amazon reports that every 100ms increase in the

page load decreases sales by 1%.

Achieving low response time at high resource utilization is challenging. Servers in typical data centers are only busy 18%

time on average, wasting power.

Dilemma between

Good Performance and High Utilization

[Dec 2010]

2

0

25

50

75

100

0 1 2 3

Time [s]

Transient bottlenecks (e.g., ten of milliseconds)

can cause wide-range response time variations

for web-facing n-tier applications. [ICAC’12]

An Important Factor: Transient Bottlenecks

Transient bottlenecks (e.g., < 100ms)

Challenging for typical monitoring tools with coarse granularity

Average: 75%

Reso

urc

e u

til.

[%]

3

Experimental Setup

RUBBoS benchmark: a bulletin board

system like Slashdot

24 web interactions

CPU intensive

Workload consists of emulated clients

Intel Xeon E5607

2 quad-core 2.26 GHz

16 GB memory

4

Motivational Example

WL 8,000 WL 8,000

Response time & throughput of a 3-minute benchmark

on the 4-tier application with increasing workloads.

Average response time is low at workload 8,000, how

about response time distribution?

200ms

5

Measured Long-Tail

Response Time Distribution

Response time distribution

at workload 8,000

The percentage of requests

over 2 seconds is more than 5%

# o

f co

mple

ted r

equest

s

Log scale

6

No Resources Are Saturated

CPU util. 79.2% CPU util. 78.1%

Workload is CPU intensive

Disk I/O utilization (<5%), network I/O utilization (<

20%), Memory usage (<40%);

CPU util. 34.6% CPU util. 26.7%

CJDBC

Tom

cat

CPU

util.

[%]

time [s]

MyS

QL C

PU

util.

[%]

time [s]

7

Sources: We find that other than bursty

workload, system environmental conditions:

Dynamic Voltage and Frequency Scaling (DVFS)

JVM garbage collection

Detection and Visualization: We develop a

fine-grained monitoring method based on

passive network tracing in network switches.

Negligible monitoring overhead for running

applications

Transient Bottlenecks:

Sources and Detection

8

Introduction & Motivation

Fine-grained load/throughput analysis method

1. Data collection via a passive network tracing tool

2. Calculation of active-load and throughput

3. Correlation analysis

Two Case Studies Dynamic Voltage and Frequency Scaling (DVFS)


Conclusion & Future Works

Outline

9

Step1: Data Collection

through Passive Network Tracing

Collect interaction messages in the system using

SysViz to measure fine-grained active load and

throughput on each server.

Active load: The # of concurrent requests in a server

Throughput: The # of completed requests of a server

Web AP DB

Network switch

Web server AP server DB server

01:58:20.039 01:58:20.053

01:58:20.072

01:58:20.081

01:58:20.090

01:58:20.124

01:58:20.134

01:58:20.142

01:58:20.154

01:58:20.161

01:58:20.182

01:58:20.193

SysViz box

10

Step2: Fine-Grained Active Load

Calculation in a Server

SysViz

monitoring for

MySQL

50ms 50ms

time arrival

timestamp

departure

timestamp

11

Saturation

area

Active load in a server

Step3: Active-Load/Throughput

Correlation Analysis

Serv

er

thro

ugh

put

Saturation

point N*

TPmax

Non-

Saturation

area

12

0

10

20

30

40

50

60

0 0.5 1 1.5 2

Act

ive

-Lo

ad [

req

]

Time [s]

Fine-grained active-load in MySQL

0

2000

4000

6000

8000

0 0.5 1 1.5 2Thro

ugh

pu

t [r

eq

/s]

Time [s]

Fine-grained throughput in MySQL

Step3: Active-Load/Throughput Analysis

for MySQL at WL 12,000

Active-load &

throughput at

every 50ms

time window

13

0

10

20

30

40

50

60

0 0.5 1 1.5 2

Act

ive

-lo

ad [

req

]

Timeline [s]

Fine-grained active-load in MySQL

0

2000

4000

6000

8000

0 0.5 1 1.5 2

Thro

ugh

pu

t [r

eq

/s]

Timeline [s]

Fine-grained throughput in MySQL

MySQL Active Load

MyS

QL t

hro

ugh

put

Step3: Active-Load/Throughput Analysis

for MySQL at WL 12,000 (Cont.)

14

Three Typical Cases of Active-

Load/Throughput Analysis

Workload

4,000

Workload

16,000

Workload

12,000

Non-saturation

case

Partial

saturation case

Full saturation

case

MySQL active-load

MyS

QL t

hro

ugh

put

MyS

QL t

hro

ugh

put

MySQL active-load

MySQL active-load

MyS

QL t

hro

ugh

put

15

Transient Bottlenecks in the

Partial Saturation Case

3-D demo

Workload

12,000

MySQL active-load

M

ySQ

L t

hro

ugh

put

16

http://www.cc.gatech.edu/~qywang/Demo/jdk15-10s-MySQL-secondBySecond.swf.html











Outline

17

DVFS is designed to adjust CPU frequency

to meet instantaneous performance needs

while minimizing power consumption

We found that the anti-synchrony between

DVFS adjustment period and workload burst

cycles causes frequent transient bottlenecks.

Dell’s BIOS-level DVFS control

Transient Bottlenecks Caused by DVFS

P-state P0 P1 P4 P5 P8

CPU Frequency [MHz] 2261 2128 1729 1596 1197

18

DVFS Off case

MySQL Active load [#]

MyS

QL t

hro

ugh

put

[req/s

]

Transient Bottlenecks

of MySQL at Workload 8,000

DVFS On case

CPU is in low

frequency

CPU is in high

frequency

# concurrent requests in MySQL MySQL Active load [#]

MyS

QL t

hro

ugh

put

[req/s

]

19

Transient Bottlenecks

of MySQL at Workload 10,000

MyS

QL t

hro

ugh

put

[req/s

]


DVFS On case

MyS

QL t

hro

ugh

put

[req/s

]


DVFS Off case

Low frequency

Median frequency

High frequency High frequency

20

System Response Time with

DVFS On/Off at Workload 10,000

Time [s] Time [s] R

esp

on

se t

ime [

s]

R

esp

on

se t

ime [

s]

DVFS On case DVFS Off case

21









Outline

22

Transient bottlenecks can cause long-tail response time distributions of an n-tier application.

We developed a fine-grained active-load/throughput analysis method which can detect and visualize transient bottlenecks.

Ongoing work: more analysis of different types of workloads and more system factors that cause transient bottlenecks.

Conclusion & Future Work

23

Thank You. Any Questions?

Qingyang Wang

[email protected]

24

in n-tier applications through fine-grained analysis

Documents