finger pointing

34
Finger Pointing Mahendra Kutare [email protected] twitter - @imaxxs

Upload: boundary

Post on 05-Dec-2014

339 views

Category:

Technology


7 download

DESCRIPTION

as boundary change the game with second by second application monitoring sometimes this will affect how you apply your problem analysis steps. perhaps things can change

TRANSCRIPT

Page 1: Finger pointing

Finger PointingMahendra Kutare

[email protected] - @imaxxs

Page 2: Finger pointing

FingerPointing is a way through wh ich humans commun icate emotions of urgency, surprise, joy, acknowledgment, achievement, blame, frustration, fear and more.

FingerPointing ?

Page 3: Finger pointing

FingerPointing ?

Some do it with one.. Some need two..

Page 4: Finger pointing

FingerPointing ?

Some do it with one.. Some need two..

Page 5: Finger pointing

Systems FingerPointing ?

Some do it everywhere...

Page 6: Finger pointing

Human Computer FingerPointing ?

Some do it with....

Page 7: Finger pointing

Systems Control Loop

Monitor

Recover

Collect

Analysis

Time to Collect

Time to Recover

Time to Detect/Analyze

Local Global

Info

Act

Page 8: Finger pointing

Systems Control Loop

Meter

Recover

Collector

Engine

Time to Collect

Time to Recover

Time to Detect/Analyze

Local Global

Page 9: Finger pointing

Problem Determination

Detection - Identifies violations or anomalies.

Diagnosis - Analyzes violations or anomalies.

Remediation - Recovers the system to normal state

Page 10: Finger pointing

Detection

Threshold

Signature

Anomaly

Page 11: Finger pointing

Detection

Thresholds - Matching single value/predicate.

Signature - Matching faults with known fault signatures. It can detect a set of know faults.

Anomalies - Learn to recognize the normal runtime behavior. It can detect previously unseen faults.

Page 12: Finger pointing

Aniketos

No use of statistical machine learning.

Uses computational geometry - convex hull.

Convex hull - Encompassing shape around a group of points.

Works independent of whether metrics are correlated or not.

Stehle, Lynch et.al ICAC 2010

Page 13: Finger pointing

Fault Detection

Page 14: Finger pointing

Training Phase

No one knows when enough training data is collected.

If a system has an extensive test suite, that represents normal behavior, then execution of the test suite will produce a good training dataset.

Replay request logs of production system on test system.

Page 15: Finger pointing

Bounded Box Example

Given two metrics A and B, if the safe range of A is 5 to 10 and B is 10 to 20 the normal behavior of the system can be represented as 2D rectangle with vertices (5,10), (5,20), (10,20) and (10,10)

Any datapoint that falls within that rectangle, for example (7,15), is classified as normal.

Any datapoint that falls outside of the rectangle, for example (15,15) is classified as anomalous.

Page 16: Finger pointing

Detection Phase

Page 17: Finger pointing

Egress/Ingress Data

volume_1s_meter_ip query, 6000 data points

Page 18: Finger pointing

Egress/Ingress Data

volume_1s_meter_ip query, 150,000 data points

Page 19: Finger pointing

Fault Detection Comparison

Maximum fault coverage, tradeoff false positives

Page 20: Finger pointing

Diagnosis

Dependency Inference

Correlation Analysis

Peer Analysis

Page 21: Finger pointing

E2EProf

Sandeep et. al DSN 2007

Useful for debugging distributed systems of black boxes.

Page 22: Finger pointing

Service Paths

Client requests take different “paths” through the software invoking dynamic dependencies across distributed systems. Ensemble of paths taken by client requests - “Service Paths”

Key idea - Convert message traces per service node to per edge signals and compute cross correlations of these signals.

Page 23: Finger pointing

Path Discovery

A request path VC1->VS1->VS2->VS4

Collect timestamp, source/dest ip at each VS node.

Calculates cross correlation between time series signals across VS nodes.

If cross correlation has a spike at a phase lag = latency between nodes, there exists a path/edge between VS nodes.

Page 24: Finger pointing

App Vis

Network topology view Augment with “service paths” ??

Page 25: Finger pointing

Remediation

Software Rejuvenation for Software Aging

Reactive - Reboots, Micro Reboots

Proactive - Time or load based

Checkpointing and Recovery

Treating bugs as allergies

Page 26: Finger pointing

Software Aging

Patriot missiles, used during the Gulf war, to destroy Iraq’s Scud missile used a computer who software accumulated errors i.e software aging.

The effect of aging in this case was mis-interpretation of an incoming Scud as not a missile but just a false alarm, which resulted in death of 28 US soldiers.

Page 27: Finger pointing

Software Rejuvenation

Periodic preemptive rollback of continuously running applications to prevent failures in the future.

Open - Not based on feedback from the system - Elapsed Time, Cumulative jobs in system

Closed - Based on some notion of system health. Continuously monitor, analyze the estimated time to exhaustion of a resource.

Trivedi et. al Duke University.

Page 28: Finger pointing

Apache Web Server

MaxRequestPerChild - If this value is set to a positive value, then the parent process of Apache kills a child process as soon as MaxRequestsPerChild request have been handled by this child process.

By doing this, Apache limits “the amount of memory a process can consume by accidental memory leak”and “helps reduce the num of process when server load reduces.”

Page 29: Finger pointing

Treating Bugs as Allergies

Inspired by allergy treatment in real life. If you are allergic to milk, remove dairy products from your diet.

Rollback the program to a recent checkpoint when a bug is detected, dynamically change the execution environment based on failure symptoms, and then re-execute the program in modified environment.

Quin et. al SOSP 2005

Page 30: Finger pointing

Treating Bugs As Allergies

Page 31: Finger pointing

Examples

Uninitialized reads may be avoided if every newly allocated buffer is filled with zeros.

Data races can be avoided by changing time related event such as thread scheduling, asynchronous events.

Page 32: Finger pointing

Environment Changes

Page 33: Finger pointing

Comparison of Rx and Alternative Approaches

For systems where reboot ~5sec is not good enoughCheckpoint, Replay bounded by reboot ~5sec

Page 34: Finger pointing