supporting system-wide similarity queries for networked system management songyun duan 1 hui zhang 2...
TRANSCRIPT
Supporting System-Wide Similarity Queries for
Networked System Management
Songyun Duan1 Hui Zhang2 Guofei Jiang2 Xiaoqiao Meng1
1. IBM T.J. Watson Research Center 2. NEC Laboratories America
Hawthorne, NY Princeton, NJUSA
www.nec-labs.com
2
Outline
■ Problem statement
Solution
Evaluation
3
Problem statement
Motivation – Large networked systems are the backbone of modern IT and
Internet services.– System dynamics and complexity lead to management
difficulties when unexpected events happen.– System administrators are overwhelmed with data in systems
management.• Massive monitoring data from extensive instrumentation.
4
Problem statement
Goal – data analysis tools based on a simple and powerful query primitive for systems management– Makes information intuitive to data users (e.g., sysamdins)– Observation: similarity query in various management tasks
• Performance management» Given the performance problem in time period T, whether and
when the system ever experienced a similar problem in the past and had been reported a successful problem diagnosis result?
• Traffic management» What are the top-k <port, protocol> pairs that exhibit the most
similar traffic patterns at an hourly time scale?
• Network security• Workload management
5
Background
Preliminaries– On managing massive logs and/or monitoring data, existing
systems support data navigation/browsing with visualization or SQL-like queries
Related Work– AT&T telecomm data set visualization tools [keim1999]
– UCBerkley TelegraphCQ: continuous dataflow processing for an uncertain world [Chandrasekaran2003]
– Yahoo Pig: web-scale log processing [olston2008]
– Amazon AWS: GrepTheWeb- Hadoop on AWS [varia2008].
– Facebook Hive: data warehousing using Hadoop [sarma2008]
Our work complements the above with application focus on systems management.– related works [Bahl et al 2007][Kandula et al 2008][Mahimkar et
al 2009]
6
System-wide Similarity Queries (S2Q)
Queries: asking the similarity of an (multiple) object(s) on their time-based states– Nearest neighbor search (Sq,k; S), which asks for the top-k
states in S that are most similar to the state S.
– Range query (Sq, d; S), which asks for all the states S that are within distance in d of Sq, the target state, and d is a similarity threshold.
Challenges – Query processing efficiency & quality
• Massive and noisy data continuously generated from many sources
– Supporting integration of domain knowledge into query processing
• Diverse management tasks, e.g., workload management, performance management, application management, & security
7
Outline
■ Problem statement– Networked system management– System-wide similarity queries
Solution
Evaluation
8
S2Q framework
System modeling– Data information (state)
Similarity metrics– To compare managed
objects
Indexing– For efficient retrieval
Similarity query primitives– Nearest neighborQ,rangeQ
Task view interface– To express mgmt. tasks
using similarity queries
– Query plan formulation and execution
9
System Modeling
Monitoring data– Multi-dimensional time-series
• <M1t, M2
t, …, Mnt >, t=0,1,2,…
– System logs / events • not considered in this paper.
Design space– Raw data
• Issues: large volume; measurement noise (reading errors, time synchronization, etc.).
– Clustering-based techniques• K-Means, LAC, etc.• Issues: hard to decide the number K; curse of dimensionality.
– Pairwise-dependency relationships• Bottom-up methodology• Studied in this paper.
Time ...
0.2 21 … 321
0.3 38 … 22
… … … …
0.7 45 … 876… … … …
t1
tk
t2
M2 MnM1
10
System Modeling – pair-wise dependency relationships
Hypotheses– dependency relationships change significantly only when
systems transition from one state to another.
Dependency relationships– statistical dependencies
• statistical correlation of time-series from a pair of system metrics using some correlation metric.
• Correlation metrics» Linear correlation
» Covariance matrix structure based correlation.
11
System Modeling – covariance matrix based dependency score
Let X and Y present two time series. The dependency score of X and Y at a time point t is computed as following:1. Generate the auto-covariance matrix of X and Y around time t
respectively.
– Where Xi,w is a time series segment [Xi, Xi+1,…,Xi+w-1], and X’i,w is the transpose of X’i,w.
2. Compute the dependency score of X and Y based on their auto-covariance matrices.i. Decompose the covariance matrices using singular value
decomposition (SVD).ii. The dependency score of X and Y is computed as the distance
between two subspaces expanded by the top-k principle components of X and Y respectively.• Dependence score = ½(||UX
TuY||+ ||UYTuX||)
• UX,Uy – top-k principle components of ACovx and ACovy.• ux,uy – the first principle components of ACovx and ACovy.
SVD output of an example ACovX
160 170 180 190 200 210 220 230 240 25010
20
30Original time-series
0 5 10 15 20 25 30-0.5
0
0.5Top 1 pattern
0 5 10 15 20 25 30-0.5
0
0.5Top 2 pattern
12
System Modeling – system-wide similarity graph
A dependency graph Gt = (V,Et) will be generated at time t– V is the set of attributes of target system objects
– Et is the set of dependency relationships between object attributes at t using the covariance-matrix-based dependency metric.
Similarity metrics on two graphs Gt1 and Gt2
– One simple metric is the sum of edge weight difference if dependency scores are used as the edge weights.
– In the evaluation, we firstly prunes away the edges whose weights are below a threshold (e.g., 0.9), and then calculates the distance between Gt1 and Gt2 as:
13
Dependence score computation–algorithm and performance
Robustness to noise: (a) time-series (b) dependence score
Robustness to time delay: (a) time-series (b) dependence score
Streaming algorithm performance
14
Outline
■ Problem statement– Networked system management– System-wide similarity queries
Solution– S2Q framework– System modeling: related work and our solution– Similarity metric and index: related work and our solution
Evaluation– Task 1: fast diagnosis of repeated failures in IT systems– Task 2: automated application traffic profiling
15
Task I: fast diagnosis of repeated failures in IT systems
Goal: reuse past diagnosis efforts by locating similar diagnosed failure instances quickly.– 50%-90% failures are recurrences of previous failures [Brodie et al 2005]
S2Q query formulation:– Q = most_similar(Sq,N; SU), which asks about the top-N failure
states in SU that are most similar to the failure state Sq to diagnose.
Experimental setting– Three-tier Web service testbed
• JBoss application server, embedded Web server, and MySQL DB• Runs auction service---Rubis---modeled on eBay• Data: #procedure invocations in Java beans of the application tier
– Various failures injected to simulate different system states• Java exceptions, deadlock, memory leak, and infinite loop, etc.• Using AFPI tool from Berkeley/Stanford ROC project
16
Task I: results
Dataset– 4120 * 105– 80 distinct failure states– 40% as test instances
Evaluation metrics– Given a state S, return N
most similar instances
Schemes– S2Q– Raw-data: K-means
clustering technique applied on the raw monitoring data
Precision = #matched / N
Recall = #matched / #historic_S
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Pre
cisi
on
S2Qraw-data
N=50
N=1
17
Task II: automated application traffic profiling Goal: automatic learning of application info from network traffics
– Hypothesis 1: ports associated with an application show similar patterns along with time and across the port group
– Hypothesis 2: traffic through randomly used ports like noise signal
S2Q query formulation:– Q = within(Oq, d;O). Oq is the state of the target traffic object, d
is a similarity threshold, and O is the set of all traffic objects found in the monitoring data.
Experimental setting– Dartmouth campus-wide wireless network traffic in packets <SrcIP, DstIP,
SrcPort, DstPort, protocol>– Aggregate the data based on <port, protocol> combinations with flow statistics
in 5-min interval• #packets, #bytes, & two entropy-related features on SrcIP and DstIP
– No prior knowledge about application-port mappings– Output: a set of applications; each is represented as a group of <port,
protocol> combinations
18
Task II: results (I)
<port, protocol> = <137, UDP> <139, TCP>
180 190 200 210 220 230 2400
2
4Original time-series
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 1 pattern
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 2 pattern
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 3 pattern
180 190 200 210 220 230 2400
2000
4000Original time-series
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 1 pattern
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 2 pattern
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 3 pattern
19
Task II: results (I)
<port, protocol> = <137, UDP> <139, TCP>
180 190 200 210 220 230 2400
2
4Original time-series
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 1 pattern
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 2 pattern
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 3 pattern
180 190 200 210 220 230 2400
2000
4000Original time-series
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 1 pattern
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 2 pattern
0 2 4 6 8 10 12 14 16 18 20-0.5
0
0.5Top 3 pattern
NetBIOS Name Service NetBIOS Session Service
20
Task II: results (II)
Data: one-day traffic trace at one sniffing point Out of ~8000 <port, protocol> combinations in the
trace, 15 applications were identified Profiling results: port number (application)
– 80 (Http), 53 (DNS), 137~139 (NetBIOS), 1214 (Kazaa), 5190 (AOL Messenger), 161 (SNMP), 0 (ICMP), 67-68 (DHCP), 1071 (BASQUARE-VOIP), 6699 (WinMX)
– 6 major applications were verified by the data owner [Kotz2002]
21
Conclusions & Future Work
Main results– A framework for System-wide Similarity Query is described.
• The framework is general for various target systems and systems management tasks.
– A robust system modeling technique based on covariance matrix structures is proposed to characterize dependency between multiple time-series.
• A graph-based system-wide similarity metric • A streaming algorithm for similarity score computation.
– Two systems management applications are evaluated by applying the proposed S2Q methodology.
Future work– Extension and optimization on event & symbolic data.– Implementation in the MapReduce distributed computation
framework. – Evaluate other management tasks, e.g., network security.