system for troubleshooting big data applications in large scale data centers chengwei wang advisor:...

System for Troubleshooting Big Data Applications in Large Scale Data Centers

Chengwei WangAdvisor: Karsten Schwan

CERCS Lab, Georgia Institute of Technology

Collaborators

• Canturk Isci (IBM Research) • Vanish Talwar, Krishna Viswanathan,

Lakshminarayan Choudur, Parthasarathy Ranganathan, Greg MacDonald, Wade Satterfield, (HP Labs)

• Mohamed Mansour (Amazon.com)• Dani Ryan (Riot Games)• Greg Eisenhauer, Matthew Wolf, Chad

Huneycutt, Liting Hu (CERCS, Georgia Tech)

Large Scale Data Center Hardware

5 x 40 x 10 x 4 x 16 x 2 x 32 = 8’192’000 cores (8 million + VMs)

Amazon EC2 has estimated 454,400 (~0.5 million) Servers.

Routers, Switches, Network Topologies ….

Large Scale Data Center Software

Twitter Storm

WebAPP

BigData

StreamData

‘Big Data’ Application

Agent

Agent

Collector

Agent

Agent

Collector

Agent

Agent

Collector

Flume Master

Web Log

Web Log

Web Log

Web Log

Web Log

Web Log

HMaster

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

Namenodes

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

Namenodes

Slave/TaskTracker

Master

Slave/TaskTracker

Slave/TaskTracker

Slave/TaskTracker

Slave/TaskTracker

Page Views

(PageID, # views)

Data Blocks

Exposed as Services in Utility Cloud

Troubleshooting War On Christmas Eve

Amazon ELB state data

accidentally deleted

12:24 PM

Netflix Streaming

Outage

12:30 PM 17:02 PM

Amazon engineers

find the root cause

2:45 AM 12/25/2012

Recover ELB state data to

state before it is deleted

5:40 AM 12/25/2012

Data state merge

process completed

8:15 AM 12/25/2012

War is over,well,

forever?

Local IssueAPI partially affected A large number of ELB services

need to be recovered

Based 2010 quarterly revenues, downtime could cost up to $1.75 million/hour

Not a perfect Christmas ……

Global IssueELB Requests High Latency

Challenges for Troubleshooting

• Dynamism : dynamic interactions/dependencies

• Large Scale : thousands to millions entities

• Overhead : profiling/tracing information required

E2E Latency

? ? ?

• Time-Sensitive : responsive troubleshooting online

Research Components

Modeling Monitoring/Analytics

System Design2

VScope: Middleware for Troubleshooting Big Data APPs1

1. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications, Middleware’12.

2. A Flexible Architecture Integrating Monitoring and Analytics for Managing Large-Scale Data Centers, ICAC’11

3. Statistical Techniques for Online Anomaly Detection in Data Centers, IM’114. Online Detection of Utility Cloud Anomalies Using Metric Distribution, NOMS’105. Ranking Anomalies in Data Centers, NOMS’12

Statistical Anomaly Detection: EbAT, Tukey,

Goodness-of-Fit3,4

Anomaly Ranking5 Guidance

Research Components


System Design

VScope: Middleware for Troubleshooting Big Data APPs


Goodness-of-Fit

Anomaly Ranking Guidance

What is VScope?

• From systems perspective, VScope is a distributed system for monitoring and analyzing metrics in data centers.

• From user’s perspective, VScope is a tool providing dynamic mechanisms and basic operations to facilitate troubleshooting.

Human Troubleshooting Activities

Interaction Analysis

Which collector did the problematic agent talk to? Which regionservers did the collector talk to?

Anomaly Detection

Monitoring agent latency, Alarm when latency high

Which agents had the abnormal latencies?

Profiling & Tracing

RPC-log in regionserversDebug-log in data nodes

VScope Operations

Interaction AnalysisAnomaly Detection Profiling & Tracing

Watch Scope Query

Continuous anomaly detection

On-line interaction tracking

Dynamic metric collection/analytics

deployment

Distributed Processing Graph (DPG)

VNode

Look-BackWindow

VNode VNode

Aggregate Monitoring Data

Loca

l Analy

sis

Results

Local Analysis

Results

Global Results

FlexibleTopology

Metrics

Metrics

MetricsMetrics

VScope System Architecture

VNode

Initiate, Change, TerminateDPG DPG DPG

metric library

VShellfunction

libraryVMaster

VScope/DPG Operations

DPGManager DPGManager

agent Flume master

collector Xen Hypervisor

Dom0 DomU DomU

VScope Software Stack

Troubleshooting Layer

Watch Scope Query

Guidance

DPG Layer

API&Cmds

VScope Runtime

Anomaly Detection & Interaction Tracking

DPGs

Usecase I: Culprit Region Servers

Normal

E2E Perf. Low

Inter-Tier Issue: When you see E2E Performance is slow, was it due to collector or region server issues? Scale: There could be thousands of region servers!Interference: High interference when turning on debug-level java logging.

Slow? Which?

Horizontal Guidance (Across Tiers)

Flume Agents

iterative analysis

Watch

E2E LatencyEntropy Detection

Abnormal Flume Agents

SLA Violation on Latency

Scope

Using Connection Graph

Related Collectors&Region Servers

Shared RegionServers

Analyzing Timing in RPC-level logs

Query

Dynamically Turn on Debugging

Processing Time in

RegionServers

VScope vs Traditional Solutions20 Region Servers, One Culprit Server

VScope has highly reduced interference to application.

Usecase II: Naughty VM

Slave/ TaskTracker

Agent

Hypervisor

Over-consumeShared Resource

(Due to heavy HDFS I/O)

Slow

Good VMNaughty VM

Inter-Software-Level Issue: it is hard to find the root cause without knowing VM-Machine mapping.

Vertical Guidance (Across SW Levels)

0.01

0.1

1

10

Flum

e La

tenc

y (S

)

0.11

10100

1000

10000100000

#Mpk

gs/s

econ

d

1

10

100

1000

10000

100000

#Mpk

gs/s

econ

d

1

10

100

#Mpk

gs/s

econ

d

Anomaly Injected HDFS Write

Remedy using Traffic Shaping in Dom0

Time

Trace 1

Trace 2

Trace 3

Trace 4

E2E Performance

Good VM

Hypervisor

Naughty VM

0.01

0.1

1

10

Flum

e La

tenc

y (S

)

0.11

10100

1000

10000100000

#Mpk

gs/s

econ

d

1

10

100

1000

10000

100000

#Mpk

gs/s

econ

d

1

10

100

#Mpk

gs/s

econ

d

Anomaly Injected HDFS Write

Remedy using Traffic Shaping in Dom0

Time

Trace 1

Trace 2

Trace 3

Trace 4

E2E Performance

Good VM

Hypervisor

Naughty VM

HDFS I/O Remedy

Watch E2E Latency

Query Good VM

Scope/Query Hypervisor

Scope/Query Naughty VM

VScope Performance Evaluation

• What’re the monitoring overheads?• How fast can VScope deploy a DPG?• How fast can VScope track interactions?• How well can VScope support analytics

functions?

Evaluation Setup

• Deployed VScope on CERCS Cloud (using OpenStack) hosting 1200 Xen Virtual Machines (VMs).

http://cloud.cercs.gatech.edu/• Each VM has 2GB memory and at least 10G disk

space.• Ubuntu Linux Servers (1TB SATA disk, 48GB

Memory, and 16 CPUs (2.40GHz).• Cluster with 1 GB Ethernet networks.

http://cloud.cercs.gatech.edu/

GTStream Benchmark

Agent

Agent

Collector

Agent

Agent

Collector

Agent

Agent

Collector

Flume Master

Web Log

Web Log

Web Log

Web Log

Web Log

Web Log

HMaster

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

Namenodes

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

Namenodes

Slave/TaskTracker

Master

Slave/TaskTracker

Slave/TaskTracker

Slave/TaskTracker

Slave/TaskTracker

Page Views

(PageID, # views)

Data Blocks

VScope Runtime Overheads

VScope has low overheads.

DPGs are doing anomaly detection and interaction tracking

DPG Deployment

Fast DPG deployment at large scale with various topologies

Deploy balanced-tree DPG on VMs with different BFs (Branching Factor)

# of vms

Interaction Tracking

Fast interaction tracking at large scale

Tracking network connection relations between VMs

# of vms

Analytics Support

Efficiently support a variety of analytics.

Measuring deployment & computation time on with real analytics

VScope Features

Debug-Level On-Line Troubleshooting Info-Level On-Line Monitoring

Low Storage

Low Network

Low Interference

Complete Coverage

Low Storage

Low Network

Low Interference

Complete Coverage

Brute-Force: Ganglia, Nagios,

Astrolabe,SDIMS

√ √ √ √ √

Sampling: GWP,

Dapper,Fay,

Chopstix

√ √ Uncontroll-able

Random √ √ √ Random

VScope √ √ Controllable Focused √ √ √ Focused

VScope Advantages: 1. Controllable Interference

2. Guided/Focused Troubleshooting

Research Components


System Design



Goodness-of-Fit


Monitoring/Analysis System Design Choices

• Traditional Design

• Novel System Design (Using DPG) > Hybrid: Federating Various Topologies > Dynamic: Topologies On-Demand

Centralized Balanced Tree Binomial Tree

Modeling Monitoring/Analysis System Performance/Cost

• Is there the best design choice in for all scales? • How does scale affect system design?• How do analytics features affect system design?• How do data center configs. affect system design?• Is there any tradeoff between performance/cost?

Data Center Parameters

*Example values are quoted from publications or gained from micro-benchmark experiments and experiences of HP

production teams

Performance/Cost Metrics

• Performance: Time to Insight (TTI) The latency between the time when (a)

monitoring metric(s) is(are) collected and the time when the analysis of the metric(s) is done.

• Cost: Capital Cost for Management Dollar amount spent on hardware/software

for monitoring/analytics.

Time To Insight (TTI) Capital Cost

Centralized

HierarchicalTree

BinomialForest

HybridTopologies

Analytical Formulations

Compare Topologies at Scale

• No one is the best in all configurations• High performance may incur high cost• Hybrid design may be a good choice

Analytics O(N) Complexity Analytics O(N2) Complexity

Capital Cost

Trade-off of Performance/Cost

0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Number of Nodes (X105)

TT

I (se

con

ds)

d=16d=2d=50d=100d=200

0 2 4 6 8 100

200

400

600

800

1000

Number of Nodes (X105)

Cap

ital

Co

st(m

illio

n $

)

d=16d=2d=50d=100d=200

0 2000 4000 6000 80000

1

2

3

4

5

6

7

Number of Nodes

TT

I(se

con

ds)

CentralizedHT-CollocatedBSFHT-Dedicated

• Hierarchical Tree (fanout 2) has best performance but has highest cost

Lowest TTI

Highest Cost

Best

• Centralized has best performance and lowest cost when <2000 nodes, but worst performance when >6000

Insights

• No static, ‘one size fits all’, topology• Design may tradeoff performance/cost• DPG can provide dynamic topology and

analytics variety support at large scale• Novel, hybrid topology can yield good

performance/cost. • The principles we follow in VScope.

Research Components


System Design



Goodness-of-Fit


Statistical Anomaly Detection

• Distribution-based anomaly detection• Online• Integrated into VScope • Dynamically deployed by VScope

A Brief Summary

• Entropy-based Anomaly Tester (EbAT)• Leveraging Tukey Method and Chi-Square Test• Experiment on Real-World Data Center Traces

Conclusion• VScope is a scalable, dynamic, lightweight middleware for

troubleshooting real-time big data applications.

• We validate VScope in large-scale cloud environment with a realistic multi-tier stream processing benchmark.

• We showcase VScope’s abilities of troubleshooting horizontally across-tiers and vertically across-software-levels in two real-world use cases.

• Through analytical modeling, we concludes that dynamism, flexibility, and tradeoff between performance and cost are needed for large scale monitoring/analytics system design.

• We proposed statistical anomaly detection algorithms based on distribution change rather than change in individual measurements

State of the Art: System Analytics

Single host

Cluster

Data Center

Cloud

Multi-Tiersar

vmstat slick

Console mining

regression

Hyp. HQGanglia

ChukwaG.work

Osmius

top

ps

Moara

PMPOpenview/

Tivoli

magpie

pinpoint sherlock

Static

Dynamic

Ph.D. ThesisResearch Area

Scale

Complexity/Online

Dynamism

Lack systems and algorithms to support dynamic, online, complex diagnosis at large scale

Chopstix

Fay GWP

Dapper

CLUE

SIAT

Future Work

• System Analytics

• Large scale complexities, a variety of workloads, big data (system logs, application traces)

• Cloud Management (resource management, troubleshooting, migration planning, performance/cost analysis); Power Management; Performance optimization, etc.

• Investigating/Leveraging large scale, online, machine learning and data mining for system analytics

Thanks! Questions?

Backup Slides

VScope System Architecture

VNode

Initiate, Change, TerminateDPG DPG DPG

metric library

VShellfunction

libraryVMaster

VScope/DPG Operations

DPGManager DPGManager

agent Flume master

collector Xen Hypervisor

Dom0 DomU DomU

OpenTSDB

TSD TSD

HistoricalData

Query

Time-Series Daemon

Why Dynamism is Important?

We cannot afford tracing everywhere!

Distribution-based vs Value-based

• Sporadic Spikes• Pattern vs individual measurement

EbAT (Entropy-based Anomaly Tester)

Time Series Analysis

1. Exponential Weighted Moving Average (EWMA)

Signal Processing

1. Wavelet Analysis

Threshold-based

1. Visual Identification

2. Three-Sigma Rule

Entropy Time Series Construction

Look back windows

Look-backwindow ofSize 3

Example

2. Perform data pre-processing• Normalization: divide values by mean of samples• Data binning: hash values into a

bin of size m+1

1. Maintain look back window

Entropy Time Series Construction

4. Entropy Calculation• Determine count of each event

ei in the n samples (ni)

• Given v unique events ei in the n samples, entropy is calculated as

3. M-Event Creation for look-back window

Monitoring Event (M-Event)@sample s

<es1, es2, es3, …., esn>

Local and Global Entropies• Entropy timeseries is

created at every level of the cloud hierarchy

• Local entropy: Leaf level entropy timeseries (at every VM)

• uses raw monitoring data as input• Global entropy: Non-leaf level entropy timeseries (aggregated entropy)

• uses child entropy timeseries as input data• can calculate entropy of child entropies or aggregate it in other ways

Entropy Time Series Processing• Entropy calculation done for every look back window results in

an entropy time series

Examples

• Sharp changes in the entropy timeseries is tagged as anomaly (or using 3-sigma rule if assuming normal dist.)

• Visual analysis or signal processing can be used

3

Gaussian Distribution

-4 -3 -2 -1 0 1 2 3 4

Lower 3σ Limit Upper 3σ Limit

Previous Threshold DefinitionGaussian/normal distribution assumed for

data 68-95-99.7 rule

Fixed thresholds: 3

Remove Distribution Assumptions

• Tukey Method - No distribution assumption - For individual values• Goodness-Of-Fit Method - No distribution assumption - test if current distribution complies with the

normal distribution derived from history

Upper Threshold: Q1 - k|Q3-Q1|

Lower Threshold: Q3 + k|Q3-Q1|

Tukey Method

||3 131 QQQltl --=

||3 133 QQQutl -+=

||0.3||5.1 133133 QQQxQQQ i -+<£-+

||5.1||0.3 131131 QQQxQQQ i --<£--Possible Outliers

Observations falling beyond these limits are called serious

outliers

Goodness-of-Fit (GOF) TestLook back window

Empirical Distribution: P1History Distribution: P

Chi Square Goodness-of-Fit (P, P1)

Pass: Normal Fail: abnormal

Value I Near-optimum thresholds

Value II Static thresholds

Experiment Results of EbAT

Entropy I

Entropy II

Entropy-based aggregation method I: using E1+E2+E3+E1*E2*E3

Entropy-based aggregation method II: using entropy of child entropies

00.10.20.30.40.50.60.70.80.9

1

Threshold I Threshold II Entropy I Entropy II

Accuracy

0

0.05

0.1

0.15

0.2

0.25

0.3

Threshold I Threshold II Entropy I Entropy II

FAR

Average 57.4% improvement in accuracy and 59.3% reduction in false alarm rate

Accuracy False Alarm Rate

Value I Value II Entropy I Entropy II Value I Value II Entropy I Entropy II

Average 48% improvement in accuracy and 50% reduction in false alarms

0 0.2 0.4 0.6 0.8 1

Relative Entropy

Tukey

Gaussian (state of art)

Accuracy

0 0.02 0.04 0.06 0.08 0.1

Relative Entropy

Tukey

Gaussian (state of art)

FPR

Experiment of Tukey and GOF

False Alarm RateAccuracy

Normal

Tukey

GOF

Normal

Tukey

GOF

system for troubleshooting big data applications in large scale data centers chengwei wang advisor:...

Documents

data nodes

large scale data centers

big data applications

views data blocks

recover elb state data

data state merge process

troubleshooting war

utility cloud slide