supporting system-wide similarity queries for networked system management songyun duan 1 hui zhang 2...

Supporting System-Wide Similarity Queries for

Networked System Management

Songyun Duan1 Hui Zhang2 Guofei Jiang2 Xiaoqiao Meng1

1. IBM T.J. Watson Research Center 2. NEC Laboratories America

Hawthorne, NY Princeton, NJUSA

www.nec-labs.com

2

Outline

■ Problem statement

Solution

Evaluation

3

Problem statement

Motivation – Large networked systems are the backbone of modern IT and

Internet services.– System dynamics and complexity lead to management

difficulties when unexpected events happen.– System administrators are overwhelmed with data in systems

management.• Massive monitoring data from extensive instrumentation.

4

Problem statement

Goal – data analysis tools based on a simple and powerful query primitive for systems management– Makes information intuitive to data users (e.g., sysamdins)– Observation: similarity query in various management tasks

• Performance management» Given the performance problem in time period T, whether and

when the system ever experienced a similar problem in the past and had been reported a successful problem diagnosis result?

• Traffic management» What are the top-k <port, protocol> pairs that exhibit the most

similar traffic patterns at an hourly time scale?

• Network security• Workload management

5

Background

Preliminaries– On managing massive logs and/or monitoring data, existing

systems support data navigation/browsing with visualization or SQL-like queries

Related Work– AT&T telecomm data set visualization tools [keim1999]

– UCBerkley TelegraphCQ: continuous dataflow processing for an uncertain world [Chandrasekaran2003]

– Yahoo Pig: web-scale log processing [olston2008]

– Amazon AWS: GrepTheWeb- Hadoop on AWS [varia2008].

– Facebook Hive: data warehousing using Hadoop [sarma2008]

Our work complements the above with application focus on systems management.– related works [Bahl et al 2007][Kandula et al 2008][Mahimkar et

al 2009]

6

System-wide Similarity Queries (S2Q)

Queries: asking the similarity of an (multiple) object(s) on their time-based states– Nearest neighbor search (Sq,k; S), which asks for the top-k

states in S that are most similar to the state S.

– Range query (Sq, d; S), which asks for all the states S that are within distance in d of Sq, the target state, and d is a similarity threshold.

Challenges – Query processing efficiency & quality

• Massive and noisy data continuously generated from many sources

– Supporting integration of domain knowledge into query processing

• Diverse management tasks, e.g., workload management, performance management, application management, & security

7

Outline

■ Problem statement– Networked system management– System-wide similarity queries

Solution

Evaluation

8

S2Q framework

System modeling– Data information (state)

Similarity metrics– To compare managed

objects

Indexing– For efficient retrieval

Similarity query primitives– Nearest neighborQ,rangeQ

Task view interface– To express mgmt. tasks

using similarity queries

– Query plan formulation and execution

9

System Modeling

Monitoring data– Multi-dimensional time-series

• <M1t, M2

t, …, Mnt >, t=0,1,2,…

– System logs / events • not considered in this paper.

Design space– Raw data

• Issues: large volume; measurement noise (reading errors, time synchronization, etc.).

– Clustering-based techniques• K-Means, LAC, etc.• Issues: hard to decide the number K; curse of dimensionality.

– Pairwise-dependency relationships• Bottom-up methodology• Studied in this paper.

Time ...

0.2 21 … 321

0.3 38 … 22

… … … …

0.7 45 … 876… … … …

t1

tk

t2

M2 MnM1

10

System Modeling – pair-wise dependency relationships

Hypotheses– dependency relationships change significantly only when

systems transition from one state to another.

Dependency relationships– statistical dependencies

• statistical correlation of time-series from a pair of system metrics using some correlation metric.

• Correlation metrics» Linear correlation

» Covariance matrix structure based correlation.

11

System Modeling – covariance matrix based dependency score

Let X and Y present two time series. The dependency score of X and Y at a time point t is computed as following:1. Generate the auto-covariance matrix of X and Y around time t

respectively.

– Where Xi,w is a time series segment [Xi, Xi+1,…,Xi+w-1], and X’i,w is the transpose of X’i,w.

2. Compute the dependency score of X and Y based on their auto-covariance matrices.i. Decompose the covariance matrices using singular value

decomposition (SVD).ii. The dependency score of X and Y is computed as the distance

between two subspaces expanded by the top-k principle components of X and Y respectively.• Dependence score = ½(||UX

TuY||+ ||UYTuX||)

• UX,Uy – top-k principle components of ACovx and ACovy.• ux,uy – the first principle components of ACovx and ACovy.

SVD output of an example ACovX

160 170 180 190 200 210 220 230 240 25010

20

30Original time-series

0 5 10 15 20 25 30-0.5

0

0.5Top 1 pattern

0 5 10 15 20 25 30-0.5

0

0.5Top 2 pattern

12

System Modeling – system-wide similarity graph

A dependency graph Gt = (V,Et) will be generated at time t– V is the set of attributes of target system objects

– Et is the set of dependency relationships between object attributes at t using the covariance-matrix-based dependency metric.

Similarity metrics on two graphs Gt1 and Gt2

– One simple metric is the sum of edge weight difference if dependency scores are used as the edge weights.

– In the evaluation, we firstly prunes away the edges whose weights are below a threshold (e.g., 0.9), and then calculates the distance between Gt1 and Gt2 as:

13

Dependence score computation–algorithm and performance

Robustness to noise: (a) time-series (b) dependence score

Robustness to time delay: (a) time-series (b) dependence score

Streaming algorithm performance

14

Outline

■ Problem statement– Networked system management– System-wide similarity queries

Solution– S2Q framework– System modeling: related work and our solution– Similarity metric and index: related work and our solution

Evaluation– Task 1: fast diagnosis of repeated failures in IT systems– Task 2: automated application traffic profiling

15

Task I: fast diagnosis of repeated failures in IT systems

Goal: reuse past diagnosis efforts by locating similar diagnosed failure instances quickly.– 50%-90% failures are recurrences of previous failures [Brodie et al 2005]

S2Q query formulation:– Q = most_similar(Sq,N; SU), which asks about the top-N failure

states in SU that are most similar to the failure state Sq to diagnose.

Experimental setting– Three-tier Web service testbed

• JBoss application server, embedded Web server, and MySQL DB• Runs auction service---Rubis---modeled on eBay• Data: #procedure invocations in Java beans of the application tier

– Various failures injected to simulate different system states• Java exceptions, deadlock, memory leak, and infinite loop, etc.• Using AFPI tool from Berkeley/Stanford ROC project

16

Task I: results

Dataset– 4120 * 105– 80 distinct failure states– 40% as test instances

Evaluation metrics– Given a state S, return N

most similar instances

Schemes– S2Q– Raw-data: K-means

clustering technique applied on the raw monitoring data

Precision = #matched / N

Recall = #matched / #historic_S

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

S2Qraw-data

N=50

N=1

17

Task II: automated application traffic profiling Goal: automatic learning of application info from network traffics

– Hypothesis 1: ports associated with an application show similar patterns along with time and across the port group

– Hypothesis 2: traffic through randomly used ports like noise signal

S2Q query formulation:– Q = within(Oq, d;O). Oq is the state of the target traffic object, d

is a similarity threshold, and O is the set of all traffic objects found in the monitoring data.

Experimental setting– Dartmouth campus-wide wireless network traffic in packets <SrcIP, DstIP,

SrcPort, DstPort, protocol>– Aggregate the data based on <port, protocol> combinations with flow statistics

in 5-min interval• #packets, #bytes, & two entropy-related features on SrcIP and DstIP

– No prior knowledge about application-port mappings– Output: a set of applications; each is represented as a group of <port,

protocol> combinations

18

Task II: results (I)

<port, protocol> = <137, UDP> <139, TCP>

180 190 200 210 220 230 2400

2


0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 1 pattern

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 2 pattern

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 3 pattern

180 190 200 210 220 230 2400

2000


0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 1 pattern

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 2 pattern

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 3 pattern

19

Task II: results (I)

<port, protocol> = <137, UDP> <139, TCP>

180 190 200 210 220 230 2400

2


0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 1 pattern

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 2 pattern

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 3 pattern

180 190 200 210 220 230 2400

2000


0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 1 pattern

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 2 pattern

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5Top 3 pattern

NetBIOS Name Service NetBIOS Session Service

20

Task II: results (II)

Data: one-day traffic trace at one sniffing point Out of ~8000 <port, protocol> combinations in the

trace, 15 applications were identified Profiling results: port number (application)

– 80 (Http), 53 (DNS), 137~139 (NetBIOS), 1214 (Kazaa), 5190 (AOL Messenger), 161 (SNMP), 0 (ICMP), 67-68 (DHCP), 1071 (BASQUARE-VOIP), 6699 (WinMX)

– 6 major applications were verified by the data owner [Kotz2002]

21

Conclusions & Future Work

Main results– A framework for System-wide Similarity Query is described.

• The framework is general for various target systems and systems management tasks.

– A robust system modeling technique based on covariance matrix structures is proposed to characterize dependency between multiple time-series.

• A graph-based system-wide similarity metric • A streaming algorithm for similarity score computation.

– Two systems management applications are evaluated by applying the proposed S2Q methodology.

Future work– Extension and optimization on event & symbolic data.– Implementation in the MapReduce distributed computation

framework. – Evaluate other management tasks, e.g., network security.

supporting system-wide similarity queries for networked system management songyun duan 1 hui zhang 2...

Documents

application management

management difficulties

data users

data warehousing

noisy data

data navigationbrowsing

system dynamics

system administrators