machine learning for automated diagnosis of distributed systems performance

49
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Machine Learning for Automated Diagnosis of Distributed Systems Performance Ira Cohen HP-Labs June 2006 http://www.hpl.hp.com/personal/Ira_Cohen

Upload: wood

Post on 12-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Machine Learning for Automated Diagnosis of Distributed Systems Performance. Ira Cohen HP-Labs June 2006 http://www.hpl.hp.com/personal/Ira_Cohen. Intersection of systems and ML/Data mining: Growing (research) area. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Learning for Automated Diagnosis of Distributed Systems Performance

© 2006 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice

Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira CohenHP-LabsJune 2006http://www.hpl.hp.com/personal/Ira_Cohen

Page 2: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 2

Intersection of systems and ML/Data mining: Growing (research) area• Berkeley’s RAD lab (Reliable Adaptable Distributed

systems lab) got $7.5mil from Google, Microsoft and Sun for: “…adoption of automated analysis techniques from Statistical

Machine Learning (SML), control theory, and machine learning, to radically improve detection speed and quality in distributed systems”

• Workshops devoted to area (e.g., SysML), papers in leading system and data mining conferences

• Part of IBM’s “Autonomic Computing” and HP’s Adaptive Enterprise visions

• Startups (e.g., Splunk, LogLogic)• And more…

Page 3: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 3

SLIC project at HP-Labs*: Statistical learning inference and control•Research objective: Provide technology enabling automated decision making, management and control of complex IT systems.

−Explore statistical learning, decision theory and machine learning as basis for automation.

*Participants/Collaborators: Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox, Steve Zhang, Jeff Chase, Rob Powers, Chengdu Huang, Blaine Nelson

I’ll Focus today on Performance diagnosis

Page 4: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 4

Intuition: Why is performance diagnosis hard?• What do you do when your PC is slow?

Page 5: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 5

Why care about performance?• Answer: It costs companies BIG money

Analysts estimate that poor application performance costs U.S.-based

companies approximately $27 billion each year

• Performance management software products revenue growing at double digit % every year!

Page 6: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 6

Challenges today in diagnosing/forecasting IT performance problems• Distributed systems/services are complex

− Thousands of systems/services/applications is typical− Multiple levels of abstractions and interactions between

components− Systems/Applications change rapidly

• Multiple levels of responsibility (infrastructure operators, application operators, DBAs, …) --> a lot of finger pointing− Problems can take days/weeks to resolve

• Loads of data, no actionable information− Operators manually search for needle in haystack− Multiple types of data sources --- lack of unifying tools to

even view data

• Operators hold past diagnosis efforts in their head - history of diagnosis efforts mostly lost.

Page 7: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 7

Translation to Machine Learning Challenges• Transforming data to information: Classification,

feature selection methods – with need for explanation• Adaptation: Learning with concept drift• Leveraging history: Transforming diagnosis to an

information retrieval problem, clustering methods, etc.

• Using multiple data sources: combining structured and semi-structured data

• Scalable machine learning solutions: distributed analysis, transfer learning

• Using human feedback (human in the loop): semi-supervised learning (active learning, semi-supervised clustering)

Page 8: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 8

Outline• Motivation (already behind us…)• Concrete example: The state of distributed

performance management today• ML challenges

− examples of research results

• Bringing in all together as a tool: Providing diagnostic capabilities as a centrally managed service

• Discussion/Summary

Page 9: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 9

Example: A real distributed HP Application architecture

Geographically distribution 3-tier application

Results shown today are from last 19+ months of data collected from this service

Page 10: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 10

Application performance “management”: Service Level Objectives (SLO)

Unhealthy = SLO Violation

Page 11: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 11

Unhealthy

Detection is not enough…

• Leverage history:− Did we see similar

problems in the past?− What were the repair

actions?− Do/Did they occur in other

data centers?

• Triage:− What are the symptoms of

the problem?− Who do I call?

• Can we forecast these problems?

• Problem prioritization− How many different

problems are there and their severity?

− Which are recurrent?

Page 12: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 12

Challenge 1: Transforming data to information…

• Many measurements (metrics) available on IT-systems (OpenView, Tivoli, etc…)− System/application metrics: CPU, memory, disk,

network utilizations, queues, etc...− Measured on a regular basis (1-5 minutes with

commercial tools).

• Other semi-structure data (log files)

Where is the relevant information?

Page 13: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 13

ML Approach: Model using Classifiers

Unhealthy F(M ,SLO)

Leverage all the data collected in the infrastructure to:

1)Use classifiers: F(M) -> SLO state2)Classification accuracy is a measure of

success3)Use feature selection to find most predictive

metrics of SLO state

Page 14: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 14

But we need an explanation, not just classification accuracy...

UnhealthyP(M,SLO)

Normal Metric has a valueassociated with healthybehavior

AbnormalMetric has a valueassociated with unhealthybehavior

Inferences (“metric attribution”):

P(M|SLO)

Our approach: Learn joint probability distribution (Bayesian network classifiers)

Page 15: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 15

Bayesian network classifiers: Results

• “Fast”: (in the context of 1-5 mins data collection)− Models takes 2-10 seconds to train

on days worth of data− Metric attribution: Takes 1ms-10ms

to compute

• Found that order of 3-10 metrics are needed (out of hundreds) to capture accurately a performance problem

• Accuracy is high (~90%)*• Experiments showed metrics are

useful for diagnosing certain problems on real systems

• Hard to capture with single model multiple types of performance problems!

SLO State

M3

M30

M32

M5

M8

Page 16: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 16

Additional issues• How much data is needed to get accurate

models?• How to detect model validity?• How to present models/results to

operators?

Page 17: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 17

Different? Same problem?

Challenge 2: Adaptation

• Systems and application change• Reasons for performance problems change

over time (and sometimes recur)

Learning with “Concept drift”

Page 18: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 18

Adaptation: Possible approaches• Single omniscient model: “Train once, use

forever”− Assumes training data provides all information.

• Online updating of model− E.g., parameter/structure updating of Bayesian

networks, online learning of Neural networks, Support vector machines, etc.

− Potentially wasteful retraining when similar problems reoccur

• Maintain ensemble of models− Requires criteria for choosing subset of models

in inference.− Criteria for adding new models to ensemble− Criteria for removing models from ensemble

Page 19: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 19

Our approach: Managing an ensemble of models for our classification approach

1. Periodically induce a new model

2. Check whether the model adds new information (classification accuracy)

3. Update the ensemble of models

Construction

Inference:Use Brier score for selection of models

Page 20: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 20

Adaptation: Results

• ~7500 samples, 5 mins/sample (one month), ~70 metrics• Classifying a sample with the Ensemble of BNCs:

− Used model with best Brier Score for predicting class (winner takes all)• Brier score was better than other measures (e.g., accuracy,

likelihood)• Winner takes all was more accurate than other combination

approaches (e.g., majority voting)

Accuracy (%)

Total Processing Time (mins)

Single model: No Adaptation

61.4 0.2

Single model trained with all history (no forgetting)

82.4 71.5

Single model with sliding window

84.2 0.9

Ensemble of Models 90.7 7.1

Page 21: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 21

Adaptation: Result

• “Single adaptive” slower to adapt to recurrent issues − Must re-learn behavior, instead of just selecting a previous model

Page 22: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 22

Additional issues• Need criteria for “aging” models• Periods of “good” behavior also change:

Need robustness to those changes as well.

Page 23: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 23

Challenge 3: Leveraging history

Diagnosis: Stuck thread due to insufficient Database connectionsRepair: Increase connections to +6Periods::::Severity: SLO time increases up to 10secs::Location: Americas. Not seen in Asia/Pacific

• It would be great to have the following system:

Page 24: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 24

Leveraging history

Diagnosis: Stuck thread due to insufficient Database connectionsRepair: Increase connections to +6Periods::::Severity: SLO time increases up to 10secs::Location: Americas. Not seen in Asia/Pacific

• Main challenge: Find a representation (signature) that captures the main characteristics of the system behavior that is:− Amenable to distance metrics− Generated automatically− In Machine readable form

Page 25: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 25

Our approach to defining signatures

UnhealthyModelsP(SLO,M)

1) Learn probabilistic classifiers

2) Inferences: Metric Attribution

Abnormal metrics

app cpu util

app alive proc high

app active proc high

DB cpu util high

3) Define these as signatures of the problems

Page 26: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 26

Example: Defining a signature

• For a given SLO violation, the models provide a list of metrics that are attributed with the violation.

• Metric has value 1 if it is attributed with the violation, -1 if it is not attributed, 0 if it is not relevant, e.g.:

Attri-bution

Page 27: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 27

Results: With signatures…• We were able to accurately retrieve past

occurrences of similar performance problems with the diagnosis efforts

• ML technique: Information retrieval

Diagnosis: Stuck thread due to insufficient Database connectionsRepair: Increase connections to +6Periods::::Severity: SLO time increases up to 10secs::Location: Americas. Not seen in Asia/Pacific

Page 28: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 28

Results: Retrieval accuracy

Top 100: 92 vs 51

Ideal P-R curve

Retrieval of "Stuck Thread" problem

Page 29: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 29

Results: With signatures we can also…

• Automatically identify groups of different problems and their severity• Identify which are recurrent• ML technique: Clustering

Page 30: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 30

Additional issues• Can we generalize and abstract signatures

for different systems/applications?• How to incorporate human feedback for

retrieval and clustering?− Semi-supervised learning: results not shown

today

Page 31: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 31

Challenge 4: Combining multiple data sources• We have a lot of semi-structured text logs,

e.g.,− Problem tickets− Event/error logs

(application/system/security/network…)− Other logs (e.g., operators actions)

• Logs can help obtain more accurate diagnosis and models – sometimes system/application metrics not enough

• Challenges: − Transforming logs to “features”: information

extraction− Doing it efficiently!

Page 32: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 32

Properties of logs• Logs events have relatively short text

messages• Much of the diversity in messages comes

from different “parameters” – dates, machine/component names. Core is less unique compared to free text.

• Amount of events can be huge (e.g., >100 million events per day for large IT systems)

Processing events needs to compress logs significantly while doing it efficiently!

Page 33: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 33

Our approach: Processing application error-logs

• Significant reduction of messages− 200,000 190

• Accurate− Clustering results validated with hierarchical tree clustering

algorithm

2006-02-26T00:00:06.461 ES_Domain:ES_hpat615_01:2257913:Thread43.ES82|commandchain.BaseErrorHandler.logException()|FUNCTIONAL|0||FatalException occurredtype=com.hp.es.service.productEntitlement.knight.logic.access.KnightIOException, message=Connection timed out, class=com.hp.es.service.productEntitlement.knight.logic.RequestKnightResultMENUCommand2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|2706||KNIGHT system unavailable: java.io.IOException2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlement.knight.logic.RequestKnightResultMENUCommand message: Connection timed out causing exception type: java.io.IOException KNIGHT URL accessed: http://vccekntpro.cce.hp.com/knight/knightwarrantyservice.asmx2006-02-26T00:00:06.466 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlement.knight.logic.access.KnightIOException: Connection timed out2006-02-26T00:00:08.279 ES_Domain:ES_hpat615_01:22579163:ExecuteThread: '16' for 'weblogic.kernel.Default'.ES82|com.hp.es.service.productEntitlement.combined.MergeAllStartedThreadsCommand.setWaitingFinished()|WARNING|3709||2006-02-26T00:00:08.279 ES_Domain:ES_hpat615_01:22579163:ExecuteThread: '16' for2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlem

Over 4,000,000 error log entries200,000+ distinct error messages

Use count of appearances over 5-minute intervals

of the features messages as metrics for

learning

Similarity-based Sequential Clustering

190 “feature messages”

Page 34: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 34

Learning Probabilistic Models• Construct probabilistic models metrics

using a “hybrid-gamma distribution” (Gamma distribution with zeros)

# of appearances

PD

F

Page 35: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 35

Results: Adding Log based metrics• Signatures using error logs metrics pointed

to the right causes in 4 out of 5 “High” severity incidents in past 2 months− System metrics were not related to the

problems in these cases

From Operator Incident Report:Diagnosis and Solution: Unable to start SWAT wrapper. Disk usage reached 100%. Cleaned up disk and restarted the wrapper…

From Application Error Log:

CORBA access failure: IDL:hpsewrapper/SystemNotAvailableException:…com.hp.es.wrapper.corba.hpsewrapper.SystemNotAvailableException

Page 36: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 36

Additional issues• With multiple instances of an application –

how to do joint, efficient processing of the logs?

• Treating events as sequences in time could lead to more accuracy and compression.

Page 37: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 37

Challenge 5: Scaling up Machine Learning techniques

• Large scale distributed applications have various level of dependencies− Multiple instances of components− Shared resources (DB, network, software

components)− Thousands to millions of metrics (features)

A B C D E

Page 38: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 38

Challenge 5: Possible approaches

• Scalable approach: Ignore dependencies between components− Putting head in the sand?− See Werner Vogel’s (Amazon’s CTO) thoughts on it…

• Centralized approach: Use all available data together for building models.− Not scalable

• A different approach: Transfer models, not metrics.− Good for components that are similar and/or

have similar measurements

Page 39: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 39

Example: Diagnosis with Multiple Instances• Method 1: diagnosing multiple instances by

sharing measurement data (metrics)

A B

Page 40: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 40

BC

D EF

GH

A

Diagnosis with Multiple Instances• Method 1: diagnosing multiple instances by

sharing measurement data (metrics)

Page 41: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 41

• Method 2: diagnosing multiple instances by sharing learning experience (models) − A form of transfer learning

A B

Diagnosis with Multiple Instances

Page 42: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 42

BC

D EF

GH

A

• Method 2: diagnosing multiple instances by sharing learning experience (models)

Diagnosis with Multiple Instances

Page 43: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 43

Metric Exchange: Does it help?• Building models based on metrics of other

instances

• Observation: metric exchange does not improve model performance for load-balanced instances

Time Epoch

Onl

ine

Pre

dict

ion

Time Epoch

Onl

ine

Pre

dict

ion

Violation detection w/ model exchangeViolation detection

w/o model exchange

False Alarm

Instance 1 Instance 2

Page 44: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 44

• Apply models trained on other instances

• Observation 1: model exchange enables quicker recognition of previously unseen problem types

• Observation 2: model exchange reduces model training cost

Model Exchange: Does it help?

Violation detection w/o model exchange

Violation detection w/ model exchange

False alarm w/ model exchange

False alarm w/o model exchange

Models imported from other instances improve accuracy

Time Epoch

Onl

ine

Pre

dict

ion

Page 45: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 45

Additional issues• How do/Can we do transfer learning on

similar but not identical instances?• More efficient methods for detecting which

data is needed from related components during diagnosis

Page 46: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 46

Providing diagnosis as a web service: SLIC’s IT-Rover

Metrics/SLO

Monitoring

Signature

construction

engine

Signature DB

Clustering

engine

Retrieval

engine

Monitored Services

Admin

Centralized diagnosis web service allows:•Retrieval across different data centers/different services/possibly different companies•Fast deployment of new algorithms•Better understanding of real problems for further development of algorithms•Value of portal is in the information (“Google” for systems)

Page 47: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 47

Discussion: Additional issues, opportunities, and challenges• Beyond the “black box”: Using domain knowledge

− Expert knowledge− Topology information− Use known dependencies and causal relationship between

components

• Provide solutions in cases where SLOs are not known− Learn relationship between business objectives and IT

performance− Anomaly detection methods with feedback mechanisms

• Beyond diagnosis: Automated control and decision making− HP-Labs work on applying adaptive controllers for

controlling systems/applications− IBM Labs work using reinforcement learning for resource

allocation

Page 48: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 48

• Presented several challenges at the intersection machine learning and IT automated diagnosis

• A relatively new area for machine learning and data mining researchers and practitioners

• Many more opportunities and challenges ahead: research and product/business wise…

Read more: www.hpl.hp.com/research/slic − SOSP-05, DSN-05, HotOS-05, KDD-05, OSDI-04

Summary

Page 49: Machine Learning for Automated Diagnosis of Distributed Systems Performance

Ira Cohen - HP-Labs 49

Publications:• Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons,  Terence Kelly,

Armando Fox, "Capturing, Indexing, Clustering, and Retrieving System History", SOSP 2005.

• Rob Powers, Ira Cohen, and Moises Goldszmidt, "Short term performance forecasting in enterprise systems",  KDD 2005.

• Moises Goldszmidt, Ira Cohen, Armando Fox and Steve Zhang, "Three research challenges at the intersection of machine learning, statistical induction, and systems", HOTOS 2005.

• Steve Zhang, Ira Cohen, Moises Goldszmidt, Julie Symons, Armando Fox, "Ensembles of models for automated diagnosis of system performance problems",  DSN 2005.

• Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, Jeff Chase, "Correlating instrumentation data to system states: A building block for automated diagnosis and control", OSDI, 2004.

• George Forman and Ira Cohen, "Beware the null hypothesis", European Conference on Machine Learning/ European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) 2005.

• Ira Cohen and Moises Goldszmidt, "Properties and Benefits of Calibrated Classifiers", European Conference on Machine Learning/ European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) 2004.

• George Forman and Ira Cohen, "Learning from Little: Comparison of Classifiers given Little Training", European Conference on Machine Learning/ European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) 2004.