automated problem diagnosis for production systems soila p. kavulya scott daniels (at&t),...
TRANSCRIPT
Automated Problem Diagnosis for Production Systems
Soila P. KavulyaScott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),
Priya Narasimhan (CMU)PARALLEL DATA LABORATORY
Carnegie Mellon University
2
Automated Problem Diagnosis• Diagnosing problems
• Creates major headaches for administrators• Worsens as scale and system complexity grows
• Goal: automate it and get proactive• Failure detection and prediction• Problem determination (or “fingerpointing”)• Problem visualization
• How: Instrumentation plus statistical analysis
November 12 http://www.pdl.cmu.edu/
http://www.pdl.cmu.edu/ 3
Target Systems for Validation• VoIP system at large telecom provider
• 10s of millions of calls per day, diverse workloads• 100s of heterogeneous network elements • Labeled traces available
• Hadoop: MapReduce implementation Hadoop clusters with homogeneous hardware
Yahoo! M45 & Opencloud production clusters Controlled experiments in Amazon EC2 cluster
Long running jobs (> 100s): Hard to label failures
November 12
http://www.pdl.cmu.edu/ 4
Assumptions of Approach• Majority of system is working correctly• Problems manifest in observable behavioral
changes• Exceptions or performance degradations
• All instrumentation is locally timestamped • Clocks are synchronized to enable system-
wide correlation of data• Instrumentation faithfully captures system
behavior
November 12
http://www.pdl.cmu.edu/ 5
Overview of Diagnostic Approach
End-to-endTrace
Construction
PerformanceCounters
ApplicationLogs
Ranked list of root-causes
Anomaly Detection
Localization
November 12
http://www.pdl.cmu.edu/ 6
Anomaly Detection Overview• Some systems have rules for anomaly
detection, e.g.,• Redialing number immediately after disconnection• Server reported error codes and exceptions
• If no rules available, rely on peer-comparison• Identifies peers (nodes, flows) in distributed
systems• Detect anomalies by identifying “odd-man-out”
November 12
http://www.pdl.cmu.edu/ 7
Anomaly Detection Approach
• Histogram comparison identifies anomalous nodes• Pairwise comparison of node histograms• Detect anomaly if difference between histograms
exceeds pre-specified threshold
Faulty node
Histograms (distributions) of durations of flows
Normal node Normal node
Nor
mal
ized
cou
nts
(tot
al 1
.0)
Nor
mal
ized
cou
nts
(tot
al 1
.0)
Nor
mal
ized
cou
nts
(tot
al 1
.0)
November 12
http://www.pdl.cmu.edu/ 8
Localization Overview1. Obtain labeled end-to-end traces
(labels indicate failures and successes)• Telecom systems
– Use heuristics, e.g., Redialing number immediately after disconnection
• Hadoop– Use peer-comparison for anomaly detection since
heuristics for detection are unavailable
2. Localize source of problems• Score attributes based on how well they distinguish
failed calls from successful ones
November 12
http://www.pdl.cmu.edu/ 9
“Truth Table” Call Representation
November 12
Server1 Server2 Customer1 Phone1 Outcome
Call1 1 1 0 1 SUCCESS
Call2 1 0 1 1 FAIL
Log Snippet
Call1: 09:31am,SUCCESS, Server1,Server2,Phone1
Call2: 09:32am,FAIL,Server1,Customer1,Phone1
Log Snippet
Call1: 09:31am,SUCCESS, Server1,Server2,Phone1
Call2: 09:32am,FAIL,Server1,Customer1,Phone1
10s of thousands of attributes
10s of millions of calls
http://www.pdl.cmu.edu/ 10
Identify Suspect Attributes• Estimate conditional probability distributions
• Prob(Success|Attribute) vs Prob(Failure|Attribute)
• Update belief on distribution with each call seen
November 12
Deg
ree
of
Bel
ief
Probability
Success|Customer1Failure|Customer1
Anomaly score: Distance between distributions
Anomaly score: Distance between distributions
http://www.pdl.cmu.edu/ 11
Find Multiple Ongoing Problems• Search for combination of attributes that
maximize anomaly score• E.g., (Customer1 and ServerOS4)• Greedy search limits combinations explored• Iterative search identifies multiple problems
November 12
1. Chronic signature1Customer1ServerOS4
2. Chronic signature2PhoneType7
Time of Day (GMT)
Fa
iled
Ca
lls
UI: Ranked list of chronics
http://www.pdl.cmu.edu/ 12
Evaluation• Prototype in use by Ops team
• Daily reports over past 2 years• Helped Ops to quickly discover new chronics
• For example, to analyze 25 million VoIP calls• 2 2.4GHz Xeon cores, used <1 GB of memory• Data loading: 1.75 minutes for 6GB of data• Diagnosis: ~4 seconds per signature
(near-interactive)
November 12
http://www.pdl.cmu.edu/ 13
1. ChronicSignature1
Service_ACustomer_A
2. ChronicSignature2
Service_ACustomer_NIP_Address_N
Call Quality (QoS) Violations
November 12
Message loss used as the event failure indicator (>1%)
Draco showed most QoS issues were tied to specific customers and not ISP network elements (as was previously believed)
Customer name, IP
Incident at ISP:
Fai
led
Cal
ls
Time of Day (GMT)
Fai
led
Cal
ls
Time of Day (GMT)
http://www.pdl.cmu.edu/ 14
In Summary…• Use peer-comparison for anomaly detection• Localize source of problems using statistics
• Applicable when end-to-end traces available• E.g., customer, network element, version conflicts
• Approach used on Trone might vary• Depends on instrumentation available• Also depends on fault-model
November 12