©nec laboratories america 1 huadong liu (u. of tennessee) hui zhang, rauf izmailov, guofei jiang,...

21
1 ©NEC Laboratories America Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang Real-time Application Monitoring and Diagnosis for Service Hosting Platforms of Black Boxes

Upload: ralf-hampton

Post on 31-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

1©NEC Laboratories America

Huadong Liu (U. of Tennessee)

Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America)

Presented by: Hui Zhang

Real-time Application Monitoring and Diagnosis for Service Hosting

Platforms of Black Boxes

Page 2: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

2©NEC Laboratories America

outline

MotivationSRAMD architectureApplication component dependency

discoveryEvaluationConclusions

Page 3: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

3©NEC Laboratories America

Motivation

Service hosting systems Web farms, service-oriented utility computing networks, Peer-to-Peer

service composition based computing grids, … Service management

Fault diagnosis, capacity planning, performance analysis, impact analysis, etc.

Challenges Application components are usually delivered as black-boxes w/o

sufficient instrumentation The huge amount of logging information in large-scale systems makes

real-time monitoring and debugging unrealistic with a centralized approach

App. 1App. 3

App. 2App. 4

Page 4: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

4©NEC Laboratories America

An intuition of the SRAMD Art

Source: www.pictureMOSAICs.com

Page 5: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

5©NEC Laboratories America

Scalable Real-time Application Monitoring and Diagnosis

SRAMD: an extensible tool that is easy to deploy scalable, and able to effectively profile the intricate dependency

relationships among interacting application components seen as black boxes.

Our approach uses low level packet traces instead of high level event

traces to get insight into application components Has end-system instrumentation for close observation on the

correlation between application performance and local resource utilization, and for enabling a rich set of queries for diagnosis

understands the overall system/application behavior and performance by aggregating and correlating summarizations from distributed components

Page 6: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

6©NEC Laboratories America

SRAMD in Operation

application Xapplication Yapplication Zhosting server

An application levelpassive resource monitor with active

summarization

An extensible framework forapplication topology discovery,

capacity planning andperformance debugging

Page 7: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

7©NEC Laboratories America

The SRAMD Controller

Aggregator retrieves, validates information

blocks available in the repository, and organizes them into per-application groups.

Collector

Aggregator

Visualizer

Diagnosis

Diagnosis generates probing requests to

related monitors with operator interaction to get detailed information about application components and to isolate possible bottlenecks for performance debugging.

Visualizer constructs in-memory DOT files

[DOT] using outputs from the aggregator and calls the Grappa [Grappa] to visualize application topologies enriched with component traffic statistics and causal probabilities.

collector passively collects summarization

data from distributed monitors through UDP.

Page 8: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

8©NEC Laboratories America

The SRAMD Controller snapshot

Page 9: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

9©NEC Laboratories America

The SRAMD Monitor

Periodically probe for CPU, memory and disk usage of every registered application component.

Passively capture network traffic and associate captured packets to registered application components,

Actively calculate useful local application statistics and dependencies from packet traces

Temporarily perform diagnosis tasks on-demand to assist performance diagnosis and debugging.

Page 10: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

10©NEC Laboratories America

Application component dependency discovery

Given two application components A and B in the system, we want to discover the following real-time dependency relationships between A and B during a time interval: are the input requests of one components caused by another

one (directly or indirectly)? and in what percentage if yes?

A B C D E

tim

e lin

e

C

B

D E

A

C

B

D E

A

C

B

D E

r1

r2 r3

A

Page 11: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

11©NEC Laboratories America

Dealing with transient connections

Local Dependency Discovery (LDD) Find IDs of peer application components that local ones

talked to in the last report interval. Every SRAMD monitor sends a list of (LocalPort, AppCompID) to the monitor at every hosting server that the communicating application components are running on.

Count the number of requests (including nesting requests) between application components and calculate the probability of their causal dependency.

Although requests appear to be nested by accident, if the same nesting relationship appears with a high probability, it is highly possible that the nesting represents a causal dependency of application components.

Page 12: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

12©NEC Laboratories America

Dealing with persistent connections and connectionlesscommunications

Traffic Regulation based Component Dependency Discovery (TRCDD) Divert socket based traffic regulation. Under

investigation.

A->B

B->C

Page 13: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

13©NEC Laboratories America

Evaluation: SRAMD overhead (1)

Experiment setup

Receiverthrulayd

Senderthrulay

SRAMController

UDP Packetsover giga ethernet

Intel 2.8GHz SMP

SRAM Monitor

Page 14: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

14©NEC Laboratories America

Evaluation: SRAMD overhead (2)

CPU overhead of the SRAMD monitor with bulk UDP traffic using different packet sending rates and packet sizes

Page 15: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

15©NEC Laboratories America

CPU overhead of packet-application matching and sniffing data rate 100Mb/s and packet size 1500 Bytes.

Evaluation: SRAMD overhead (3)

Association Probability

1/250k 0.01 0.02 0.03 0.04

CPU Overhead % 4.22 5.53 6.15 7.12 7.86

Page 16: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

16©NEC Laboratories America

Evaluation: LDD algorithm (1)

Experiment setup

Clients (httperf)

Web Server (tinyproxy)

Web Server (tinyproxy)

Database Server (derby)

Database Server (derby)

ApplicationServer I1

ApplicationServer I2

W1 W2

A1 A2

D1 D2

C

ApplicationServer I1

ApplicationServer I2

Logic view physical view

Page 17: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

17©NEC Laboratories America

Evaluation: LDD algorithm (2)

Causal probability as observed on application server A1 with different number of concurrent clients

Page 18: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

18©NEC Laboratories America

Conclusions and Future Work

An unobtrusive application-level monitoring and diagnosis tool that does not make any assumptions about the traced applications.

Two schemes to infer dependency relationships of application components in different scenarios.

An initial assessment of the quality and overhead of application-level packet tracing and an evaluation of the statistical dependency discovery scheme.

Possible extensions A kernel module to obtain per-application disk read / write

statistics Application of data mining techniques to packet traces

Page 19: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

19©NEC Laboratories America

Thanks!

Questions?

Page 20: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

20©NEC Laboratories America

Backup slides

Page 21: ©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang

21©NEC Laboratories America

Calculate Response Time from Traces

t1 t3

t2

t4

a

b

ct5

WS

AS

DS

WS

AS

DS

WS

AS

DS