system for troubleshooting big data applications in large scale data centers chengwei wang advisor:...

59
System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Upload: tara-viner

Post on 14-Dec-2015

220 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

System for Troubleshooting Big Data Applications in Large Scale Data Centers

Chengwei WangAdvisor: Karsten Schwan

CERCS Lab, Georgia Institute of Technology

Page 2: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Collaborators

• Canturk Isci (IBM Research) • Vanish Talwar, Krishna Viswanathan,

Lakshminarayan Choudur, Parthasarathy Ranganathan, Greg MacDonald, Wade Satterfield, (HP Labs)

• Mohamed Mansour (Amazon.com)• Dani Ryan (Riot Games)• Greg Eisenhauer, Matthew Wolf, Chad

Huneycutt, Liting Hu (CERCS, Georgia Tech)

Page 3: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Large Scale Data Center Hardware

5 x 40 x 10 x 4 x 16 x 2 x 32 = 8’192’000 cores (8 million + VMs)

Amazon EC2 has estimated 454,400 (~0.5 million) Servers.

Routers, Switches, Network Topologies ….

Page 4: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Large Scale Data Center Software

Twitter Storm

WebAPP

BigData

StreamData

Page 5: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

‘Big Data’ Application

Agent

Agent

Collector

Agent

Agent

Collector

Agent

Agent

Collector

Flume Master

Web Log

Web Log

Web Log

Web Log

Web Log

Web Log

HMaster

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

Namenodes

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

Namenodes

Slave/TaskTracker

Master

Slave/TaskTracker

Slave/TaskTracker

Slave/TaskTracker

Slave/TaskTracker

Page Views

(PageID, # views)

Data Blocks

Exposed as Services in Utility Cloud

Page 6: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Troubleshooting War On Christmas Eve

Amazon ELB state data

accidentally deleted

12:24 PM

Netflix Streaming

Outage

12:30 PM 17:02 PM

Amazon engineers

find the root cause

2:45 AM 12/25/2012

Recover ELB state data to

state before it is deleted

5:40 AM 12/25/2012

Data state merge

process completed

8:15 AM 12/25/2012

War is over,well,

forever?

Local IssueAPI partially affected A large number of ELB services

need to be recovered

Based 2010 quarterly revenues, downtime could cost up to $1.75 million/hour

Not a perfect Christmas ……

Global IssueELB Requests High Latency

Page 7: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Challenges for Troubleshooting

• Dynamism : dynamic interactions/dependencies

• Large Scale : thousands to millions entities

• Overhead : profiling/tracing information required

E2E Latency

? ? ?

• Time-Sensitive : responsive troubleshooting online

Page 8: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Research Components

Modeling Monitoring/Analytics

System Design2

VScope: Middleware for Troubleshooting Big Data APPs1

1. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications, Middleware’12.

2. A Flexible Architecture Integrating Monitoring and Analytics for Managing Large-Scale Data Centers, ICAC’11

3. Statistical Techniques for Online Anomaly Detection in Data Centers, IM’114. Online Detection of Utility Cloud Anomalies Using Metric Distribution, NOMS’105. Ranking Anomalies in Data Centers, NOMS’12

Statistical Anomaly Detection: EbAT, Tukey,

Goodness-of-Fit3,4

Anomaly Ranking5 Guidance

Page 9: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Research Components

Modeling Monitoring/Analytics

System Design

VScope: Middleware for Troubleshooting Big Data APPs

Statistical Anomaly Detection: EbAT, Tukey,

Goodness-of-Fit

Anomaly Ranking Guidance

Page 10: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

What is VScope?

• From systems perspective, VScope is a distributed system for monitoring and analyzing metrics in data centers.

• From user’s perspective, VScope is a tool providing dynamic mechanisms and basic operations to facilitate troubleshooting.

Page 11: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Human Troubleshooting Activities

Interaction Analysis

Which collector did the problematic agent talk to? Which regionservers did the collector talk to?

Anomaly Detection

Monitoring agent latency, Alarm when latency high

Which agents had the abnormal latencies?

Profiling & Tracing

RPC-log in regionserversDebug-log in data nodes

Page 12: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

VScope Operations

Interaction AnalysisAnomaly Detection Profiling & Tracing

Watch Scope Query

Continuous anomaly detection

On-line interaction tracking

Dynamic metric collection/analytics

deployment

Page 13: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Distributed Processing Graph (DPG)

VNode

Look-BackWindow

VNode VNode

Aggregate Monitoring Data

Loca

l Analy

sis

Results

Local Analysis

Results

Global Results

FlexibleTopology

Metrics

Metrics

MetricsMetrics

Page 14: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

VScope System Architecture

VNode

Initiate, Change, TerminateDPG DPG DPG

metric library

VShellfunction

libraryVMaster

VScope/DPG Operations

DPGManager DPGManager

agent Flume master

collector Xen Hypervisor

Dom0 DomU DomU

Page 15: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

VScope Software Stack

Troubleshooting Layer

Watch Scope Query

Guidance

DPG Layer

API&Cmds

VScope Runtime

Anomaly Detection & Interaction Tracking

DPGs

Page 16: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Usecase I: Culprit Region Servers

Normal

E2E Perf. Low

Inter-Tier Issue: When you see E2E Performance is slow, was it due to collector or region server issues? Scale: There could be thousands of region servers!Interference: High interference when turning on debug-level java logging.

Slow? Which?

Page 17: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Horizontal Guidance (Across Tiers)

Flume Agents

iterative analysis

Watch

E2E LatencyEntropy Detection

Abnormal Flume Agents

SLA Violation on Latency

Scope

Using Connection Graph

Related Collectors&Region Servers

Shared RegionServers

Analyzing Timing in RPC-level logs

Query

Dynamically Turn on Debugging

Processing Time in

RegionServers

Page 18: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

VScope vs Traditional Solutions20 Region Servers, One Culprit Server

VScope has highly reduced interference to application.

Page 19: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Usecase II: Naughty VM

Slave/ TaskTracker

Agent

Hypervisor

Over-consumeShared Resource

(Due to heavy HDFS I/O)

Slow

Good VMNaughty VM

Inter-Software-Level Issue: it is hard to find the root cause without knowing VM-Machine mapping.

Page 20: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Vertical Guidance (Across SW Levels)

0.01

0.1

1

10

Flum

e La

tenc

y (S

)

0.11

10100

1000

10000100000

#Mpk

gs/s

econ

d

1

10

100

1000

10000

100000

#Mpk

gs/s

econ

d

1

10

100

#Mpk

gs/s

econ

d

Anomaly Injected HDFS Write

Remedy using Traffic Shaping in Dom0

Time

Trace 1

Trace 2

Trace 3

Trace 4

E2E Performance

Good VM

Hypervisor

Naughty VM

0.01

0.1

1

10

Flum

e La

tenc

y (S

)

0.11

10100

1000

10000100000

#Mpk

gs/s

econ

d

1

10

100

1000

10000

100000

#Mpk

gs/s

econ

d

1

10

100

#Mpk

gs/s

econ

d

Anomaly Injected HDFS Write

Remedy using Traffic Shaping in Dom0

Time

Trace 1

Trace 2

Trace 3

Trace 4

E2E Performance

Good VM

Hypervisor

Naughty VM

HDFS I/O Remedy

Watch E2E Latency

Query Good VM

Scope/Query Hypervisor

Scope/Query Naughty VM

Page 21: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

VScope Performance Evaluation

• What’re the monitoring overheads?• How fast can VScope deploy a DPG?• How fast can VScope track interactions?• How well can VScope support analytics

functions?

Page 22: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Evaluation Setup

• Deployed VScope on CERCS Cloud (using OpenStack) hosting 1200 Xen Virtual Machines (VMs).

http://cloud.cercs.gatech.edu/• Each VM has 2GB memory and at least 10G disk

space.• Ubuntu Linux Servers (1TB SATA disk, 48GB

Memory, and 16 CPUs (2.40GHz).• Cluster with 1 GB Ethernet networks.

Page 23: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

GTStream Benchmark

Agent

Agent

Collector

Agent

Agent

Collector

Agent

Agent

Collector

Flume Master

Web Log

Web Log

Web Log

Web Log

Web Log

Web Log

HMaster

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

Namenodes

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

Namenodes

Slave/TaskTracker

Master

Slave/TaskTracker

Slave/TaskTracker

Slave/TaskTracker

Slave/TaskTracker

Page Views

(PageID, # views)

Data Blocks

Page 24: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

VScope Runtime Overheads

VScope has low overheads.

DPGs are doing anomaly detection and interaction tracking

Page 25: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

DPG Deployment

Fast DPG deployment at large scale with various topologies

Deploy balanced-tree DPG on VMs with different BFs (Branching Factor)

# of vms

Page 26: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Interaction Tracking

Fast interaction tracking at large scale

Tracking network connection relations between VMs

# of vms

Page 27: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Analytics Support

Efficiently support a variety of analytics.

Measuring deployment & computation time on with real analytics

Page 28: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

VScope Features

Debug-Level On-Line Troubleshooting Info-Level On-Line Monitoring

Low Storage

Low Network

Low Interference

Complete Coverage

Low Storage

Low Network

Low Interference

Complete Coverage

Brute-Force: Ganglia, Nagios,

Astrolabe,SDIMS

√ √ √ √ √

Sampling: GWP,

Dapper,Fay,

Chopstix

√ √ Uncontroll-able

Random √ √ √ Random

VScope √ √ Controllable Focused √ √ √ Focused

VScope Advantages: 1. Controllable Interference

2. Guided/Focused Troubleshooting

Page 29: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Research Components

Modeling Monitoring/Analytics

System Design

VScope: Middleware for Troubleshooting Big Data APPs

Statistical Anomaly Detection: EbAT, Tukey,

Goodness-of-Fit

Anomaly Ranking Guidance

Page 30: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Monitoring/Analysis System Design Choices

• Traditional Design

• Novel System Design (Using DPG) > Hybrid: Federating Various Topologies > Dynamic: Topologies On-Demand

Centralized Balanced Tree Binomial Tree

Page 31: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Modeling Monitoring/Analysis System Performance/Cost

• Is there the best design choice in for all scales? • How does scale affect system design?• How do analytics features affect system design?• How do data center configs. affect system design?• Is there any tradeoff between performance/cost?

Page 32: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Data Center Parameters

*Example values are quoted from publications or gained from micro-benchmark experiments and experiences of HP

production teams

Page 33: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Performance/Cost Metrics

• Performance: Time to Insight (TTI) The latency between the time when (a)

monitoring metric(s) is(are) collected and the time when the analysis of the metric(s) is done.

• Cost: Capital Cost for Management Dollar amount spent on hardware/software

for monitoring/analytics.

Page 34: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Time To Insight (TTI) Capital Cost

Centralized

HierarchicalTree

BinomialForest

HybridTopologies

Analytical Formulations

Page 35: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Compare Topologies at Scale

• No one is the best in all configurations• High performance may incur high cost• Hybrid design may be a good choice

Analytics O(N) Complexity Analytics O(N2) Complexity

Capital Cost

Page 36: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Trade-off of Performance/Cost

0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Number of Nodes (X105)

TT

I (se

con

ds)

d=16d=2d=50d=100d=200

0 2 4 6 8 100

200

400

600

800

1000

Number of Nodes (X105)

Cap

ital

Co

st(m

illio

n $

)

d=16d=2d=50d=100d=200

0 2000 4000 6000 80000

1

2

3

4

5

6

7

Number of Nodes

TT

I(se

con

ds)

CentralizedHT-CollocatedBSFHT-Dedicated

• Hierarchical Tree (fanout 2) has best performance but has highest cost

Lowest TTI

Highest Cost

Best

• Centralized has best performance and lowest cost when <2000 nodes, but worst performance when >6000

Page 37: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Insights

• No static, ‘one size fits all’, topology• Design may tradeoff performance/cost• DPG can provide dynamic topology and

analytics variety support at large scale• Novel, hybrid topology can yield good

performance/cost. • The principles we follow in VScope.

Page 38: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Research Components

Modeling Monitoring/Analytics

System Design

VScope: Middleware for Troubleshooting Big Data APPs

Statistical Anomaly Detection: EbAT, Tukey,

Goodness-of-Fit

Anomaly Ranking Guidance

Page 39: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Statistical Anomaly Detection

• Distribution-based anomaly detection• Online• Integrated into VScope • Dynamically deployed by VScope

Page 40: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

A Brief Summary

• Entropy-based Anomaly Tester (EbAT)• Leveraging Tukey Method and Chi-Square Test• Experiment on Real-World Data Center Traces

Page 41: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Conclusion• VScope is a scalable, dynamic, lightweight middleware for

troubleshooting real-time big data applications.

• We validate VScope in large-scale cloud environment with a realistic multi-tier stream processing benchmark.

• We showcase VScope’s abilities of troubleshooting horizontally across-tiers and vertically across-software-levels in two real-world use cases.

• Through analytical modeling, we concludes that dynamism, flexibility, and tradeoff between performance and cost are needed for large scale monitoring/analytics system design.

• We proposed statistical anomaly detection algorithms based on distribution change rather than change in individual measurements

Page 42: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

State of the Art: System Analytics

Single host

Cluster

Data Center

Cloud

Multi-Tiersar

vmstat slick

Console mining

regression

Hyp. HQGanglia

ChukwaG.work

Osmius

top

ps

Moara

PMPOpenview/

Tivoli

magpie

pinpoint sherlock

Static

Dynamic

Ph.D. ThesisResearch Area

Scale

Complexity/Online

Dynamism

Lack systems and algorithms to support dynamic, online, complex diagnosis at large scale

Chopstix

Fay GWP

Dapper

CLUE

SIAT

Page 43: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Future Work

• System Analytics

• Large scale complexities, a variety of workloads, big data (system logs, application traces)

• Cloud Management (resource management, troubleshooting, migration planning, performance/cost analysis); Power Management; Performance optimization, etc.

• Investigating/Leveraging large scale, online, machine learning and data mining for system analytics

Page 44: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Thanks! Questions?

Page 45: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Backup Slides

Page 46: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

VScope System Architecture

VNode

Initiate, Change, TerminateDPG DPG DPG

metric library

VShellfunction

libraryVMaster

VScope/DPG Operations

DPGManager DPGManager

agent Flume master

collector Xen Hypervisor

Dom0 DomU DomU

OpenTSDB

TSD TSD

HistoricalData

Query

Time-Series Daemon

Page 47: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Why Dynamism is Important?

We cannot afford tracing everywhere!

Page 48: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Distribution-based vs Value-based

• Sporadic Spikes• Pattern vs individual measurement

Page 49: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

EbAT (Entropy-based Anomaly Tester)

Time Series Analysis

1. Exponential Weighted Moving Average (EWMA)

Signal Processing

1. Wavelet Analysis

Threshold-based

1. Visual Identification

2. Three-Sigma Rule

Page 50: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Entropy Time Series Construction

Look back windows

Look-backwindow ofSize 3

Example

2. Perform data pre-processing• Normalization: divide values by mean of samples• Data binning: hash values into a

bin of size m+1

1. Maintain look back window

Page 51: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Entropy Time Series Construction

4. Entropy Calculation• Determine count of each event

ei in the n samples (ni)

• Given v unique events ei in the n samples, entropy is calculated as

3. M-Event Creation for look-back window

Monitoring Event (M-Event)@sample s

<es1, es2, es3, …., esn>

Page 52: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Local and Global Entropies• Entropy timeseries is

created at every level of the cloud hierarchy

• Local entropy: Leaf level entropy timeseries (at every VM)

• uses raw monitoring data as input• Global entropy: Non-leaf level entropy timeseries (aggregated entropy)

• uses child entropy timeseries as input data• can calculate entropy of child entropies or aggregate it in other ways

Page 53: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Entropy Time Series Processing• Entropy calculation done for every look back window results in

an entropy time series

Examples

• Sharp changes in the entropy timeseries is tagged as anomaly (or using 3-sigma rule if assuming normal dist.)

• Visual analysis or signal processing can be used

Page 54: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

3

Gaussian Distribution

-4 -3 -2 -1 0 1 2 3 4

Lower 3σ Limit Upper 3σ Limit

Previous Threshold DefinitionGaussian/normal distribution assumed for

data 68-95-99.7 rule

Fixed thresholds: 3

Page 55: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Remove Distribution Assumptions

• Tukey Method - No distribution assumption - For individual values• Goodness-Of-Fit Method - No distribution assumption - test if current distribution complies with the

normal distribution derived from history

Page 56: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Upper Threshold: Q1 - k|Q3-Q1|

Lower Threshold: Q3 + k|Q3-Q1|

Tukey Method

||3 131 QQQltl --=

||3 133 QQQutl -+=

||0.3||5.1 133133 QQQxQQQ i -+<£-+

||5.1||0.3 131131 QQQxQQQ i --<£--Possible Outliers

Observations falling beyond these limits are called serious

outliers

Page 57: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Goodness-of-Fit (GOF) TestLook back window

Empirical Distribution: P1History Distribution: P

Chi Square Goodness-of-Fit (P, P1)

Pass: Normal Fail: abnormal

Page 58: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Value I Near-optimum thresholds

Value II Static thresholds

Experiment Results of EbAT

Entropy I

Entropy II

Entropy-based aggregation method I: using E1+E2+E3+E1*E2*E3

Entropy-based aggregation method II: using entropy of child entropies

00.10.20.30.40.50.60.70.80.9

1

Threshold I Threshold II Entropy I Entropy II

Accuracy

0

0.05

0.1

0.15

0.2

0.25

0.3

Threshold I Threshold II Entropy I Entropy II

FAR

Average 57.4% improvement in accuracy and 59.3% reduction in false alarm rate

Accuracy False Alarm Rate

Value I Value II Entropy I Entropy II Value I Value II Entropy I Entropy II

Page 59: System for Troubleshooting Big Data Applications in Large Scale Data Centers Chengwei Wang Advisor: Karsten Schwan CERCS Lab, Georgia Institute of Technology

Average 48% improvement in accuracy and 50% reduction in false alarms

0 0.2 0.4 0.6 0.8 1

Relative Entropy

Tukey

Gaussian (state of art)

Accuracy

0 0.02 0.04 0.06 0.08 0.1

Relative Entropy

Tukey

Gaussian (state of art)

FPR

Experiment of Tukey and GOF

False Alarm RateAccuracy

Normal

Tukey

GOF

Normal

Tukey

GOF