system for troubleshooting big data applications in large scale data centers chengwei wang advisor:...
TRANSCRIPT
System for Troubleshooting Big Data Applications in Large Scale Data Centers
Chengwei WangAdvisor: Karsten Schwan
CERCS Lab, Georgia Institute of Technology
Collaborators
• Canturk Isci (IBM Research) • Vanish Talwar, Krishna Viswanathan,
Lakshminarayan Choudur, Parthasarathy Ranganathan, Greg MacDonald, Wade Satterfield, (HP Labs)
• Mohamed Mansour (Amazon.com)• Dani Ryan (Riot Games)• Greg Eisenhauer, Matthew Wolf, Chad
Huneycutt, Liting Hu (CERCS, Georgia Tech)
Large Scale Data Center Hardware
5 x 40 x 10 x 4 x 16 x 2 x 32 = 8’192’000 cores (8 million + VMs)
Amazon EC2 has estimated 454,400 (~0.5 million) Servers.
Routers, Switches, Network Topologies ….
Large Scale Data Center Software
Twitter Storm
WebAPP
BigData
StreamData
‘Big Data’ Application
Agent
Agent
Collector
Agent
Agent
Collector
Agent
Agent
Collector
Flume Master
Web Log
Web Log
Web Log
Web Log
Web Log
Web Log
HMaster
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Namenodes
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Namenodes
Slave/TaskTracker
Master
Slave/TaskTracker
Slave/TaskTracker
Slave/TaskTracker
Slave/TaskTracker
Page Views
(PageID, # views)
Data Blocks
Exposed as Services in Utility Cloud
Troubleshooting War On Christmas Eve
Amazon ELB state data
accidentally deleted
12:24 PM
Netflix Streaming
Outage
12:30 PM 17:02 PM
Amazon engineers
find the root cause
2:45 AM 12/25/2012
Recover ELB state data to
state before it is deleted
5:40 AM 12/25/2012
Data state merge
process completed
8:15 AM 12/25/2012
War is over,well,
forever?
Local IssueAPI partially affected A large number of ELB services
need to be recovered
Based 2010 quarterly revenues, downtime could cost up to $1.75 million/hour
Not a perfect Christmas ……
Global IssueELB Requests High Latency
Challenges for Troubleshooting
• Dynamism : dynamic interactions/dependencies
• Large Scale : thousands to millions entities
• Overhead : profiling/tracing information required
E2E Latency
? ? ?
• Time-Sensitive : responsive troubleshooting online
Research Components
Modeling Monitoring/Analytics
System Design2
VScope: Middleware for Troubleshooting Big Data APPs1
1. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications, Middleware’12.
2. A Flexible Architecture Integrating Monitoring and Analytics for Managing Large-Scale Data Centers, ICAC’11
3. Statistical Techniques for Online Anomaly Detection in Data Centers, IM’114. Online Detection of Utility Cloud Anomalies Using Metric Distribution, NOMS’105. Ranking Anomalies in Data Centers, NOMS’12
Statistical Anomaly Detection: EbAT, Tukey,
Goodness-of-Fit3,4
Anomaly Ranking5 Guidance
Research Components
Modeling Monitoring/Analytics
System Design
VScope: Middleware for Troubleshooting Big Data APPs
Statistical Anomaly Detection: EbAT, Tukey,
Goodness-of-Fit
Anomaly Ranking Guidance
What is VScope?
• From systems perspective, VScope is a distributed system for monitoring and analyzing metrics in data centers.
• From user’s perspective, VScope is a tool providing dynamic mechanisms and basic operations to facilitate troubleshooting.
Human Troubleshooting Activities
Interaction Analysis
Which collector did the problematic agent talk to? Which regionservers did the collector talk to?
Anomaly Detection
Monitoring agent latency, Alarm when latency high
Which agents had the abnormal latencies?
Profiling & Tracing
RPC-log in regionserversDebug-log in data nodes
VScope Operations
Interaction AnalysisAnomaly Detection Profiling & Tracing
Watch Scope Query
Continuous anomaly detection
On-line interaction tracking
Dynamic metric collection/analytics
deployment
Distributed Processing Graph (DPG)
VNode
Look-BackWindow
VNode VNode
Aggregate Monitoring Data
Loca
l Analy
sis
Results
Local Analysis
Results
Global Results
FlexibleTopology
Metrics
Metrics
MetricsMetrics
VScope System Architecture
VNode
Initiate, Change, TerminateDPG DPG DPG
metric library
VShellfunction
libraryVMaster
VScope/DPG Operations
DPGManager DPGManager
agent Flume master
collector Xen Hypervisor
Dom0 DomU DomU
VScope Software Stack
Troubleshooting Layer
Watch Scope Query
Guidance
DPG Layer
API&Cmds
VScope Runtime
Anomaly Detection & Interaction Tracking
DPGs
Usecase I: Culprit Region Servers
Normal
E2E Perf. Low
Inter-Tier Issue: When you see E2E Performance is slow, was it due to collector or region server issues? Scale: There could be thousands of region servers!Interference: High interference when turning on debug-level java logging.
Slow? Which?
Horizontal Guidance (Across Tiers)
Flume Agents
iterative analysis
Watch
E2E LatencyEntropy Detection
Abnormal Flume Agents
SLA Violation on Latency
Scope
Using Connection Graph
Related Collectors&Region Servers
Shared RegionServers
Analyzing Timing in RPC-level logs
Query
Dynamically Turn on Debugging
Processing Time in
RegionServers
VScope vs Traditional Solutions20 Region Servers, One Culprit Server
VScope has highly reduced interference to application.
Usecase II: Naughty VM
Slave/ TaskTracker
Agent
Hypervisor
Over-consumeShared Resource
(Due to heavy HDFS I/O)
Slow
Good VMNaughty VM
Inter-Software-Level Issue: it is hard to find the root cause without knowing VM-Machine mapping.
Vertical Guidance (Across SW Levels)
0.01
0.1
1
10
Flum
e La
tenc
y (S
)
0.11
10100
1000
10000100000
#Mpk
gs/s
econ
d
1
10
100
1000
10000
100000
#Mpk
gs/s
econ
d
1
10
100
#Mpk
gs/s
econ
d
Anomaly Injected HDFS Write
Remedy using Traffic Shaping in Dom0
Time
Trace 1
Trace 2
Trace 3
Trace 4
E2E Performance
Good VM
Hypervisor
Naughty VM
0.01
0.1
1
10
Flum
e La
tenc
y (S
)
0.11
10100
1000
10000100000
#Mpk
gs/s
econ
d
1
10
100
1000
10000
100000
#Mpk
gs/s
econ
d
1
10
100
#Mpk
gs/s
econ
d
Anomaly Injected HDFS Write
Remedy using Traffic Shaping in Dom0
Time
Trace 1
Trace 2
Trace 3
Trace 4
E2E Performance
Good VM
Hypervisor
Naughty VM
HDFS I/O Remedy
Watch E2E Latency
Query Good VM
Scope/Query Hypervisor
Scope/Query Naughty VM
VScope Performance Evaluation
• What’re the monitoring overheads?• How fast can VScope deploy a DPG?• How fast can VScope track interactions?• How well can VScope support analytics
functions?
Evaluation Setup
• Deployed VScope on CERCS Cloud (using OpenStack) hosting 1200 Xen Virtual Machines (VMs).
http://cloud.cercs.gatech.edu/• Each VM has 2GB memory and at least 10G disk
space.• Ubuntu Linux Servers (1TB SATA disk, 48GB
Memory, and 16 CPUs (2.40GHz).• Cluster with 1 GB Ethernet networks.
GTStream Benchmark
Agent
Agent
Collector
Agent
Agent
Collector
Agent
Agent
Collector
Flume Master
Web Log
Web Log
Web Log
Web Log
Web Log
Web Log
HMaster
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Namenodes
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Namenodes
Slave/TaskTracker
Master
Slave/TaskTracker
Slave/TaskTracker
Slave/TaskTracker
Slave/TaskTracker
Page Views
(PageID, # views)
Data Blocks
VScope Runtime Overheads
VScope has low overheads.
DPGs are doing anomaly detection and interaction tracking
DPG Deployment
Fast DPG deployment at large scale with various topologies
Deploy balanced-tree DPG on VMs with different BFs (Branching Factor)
# of vms
Interaction Tracking
Fast interaction tracking at large scale
Tracking network connection relations between VMs
# of vms
Analytics Support
Efficiently support a variety of analytics.
Measuring deployment & computation time on with real analytics
VScope Features
Debug-Level On-Line Troubleshooting Info-Level On-Line Monitoring
Low Storage
Low Network
Low Interference
Complete Coverage
Low Storage
Low Network
Low Interference
Complete Coverage
Brute-Force: Ganglia, Nagios,
Astrolabe,SDIMS
√ √ √ √ √
Sampling: GWP,
Dapper,Fay,
Chopstix
√ √ Uncontroll-able
Random √ √ √ Random
VScope √ √ Controllable Focused √ √ √ Focused
VScope Advantages: 1. Controllable Interference
2. Guided/Focused Troubleshooting
Research Components
Modeling Monitoring/Analytics
System Design
VScope: Middleware for Troubleshooting Big Data APPs
Statistical Anomaly Detection: EbAT, Tukey,
Goodness-of-Fit
Anomaly Ranking Guidance
Monitoring/Analysis System Design Choices
• Traditional Design
• Novel System Design (Using DPG) > Hybrid: Federating Various Topologies > Dynamic: Topologies On-Demand
Centralized Balanced Tree Binomial Tree
Modeling Monitoring/Analysis System Performance/Cost
• Is there the best design choice in for all scales? • How does scale affect system design?• How do analytics features affect system design?• How do data center configs. affect system design?• Is there any tradeoff between performance/cost?
Data Center Parameters
*Example values are quoted from publications or gained from micro-benchmark experiments and experiences of HP
production teams
Performance/Cost Metrics
• Performance: Time to Insight (TTI) The latency between the time when (a)
monitoring metric(s) is(are) collected and the time when the analysis of the metric(s) is done.
• Cost: Capital Cost for Management Dollar amount spent on hardware/software
for monitoring/analytics.
Time To Insight (TTI) Capital Cost
Centralized
HierarchicalTree
BinomialForest
HybridTopologies
Analytical Formulations
Compare Topologies at Scale
• No one is the best in all configurations• High performance may incur high cost• Hybrid design may be a good choice
Analytics O(N) Complexity Analytics O(N2) Complexity
Capital Cost
Trade-off of Performance/Cost
0 2 4 6 8 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Number of Nodes (X105)
TT
I (se
con
ds)
d=16d=2d=50d=100d=200
0 2 4 6 8 100
200
400
600
800
1000
Number of Nodes (X105)
Cap
ital
Co
st(m
illio
n $
)
d=16d=2d=50d=100d=200
0 2000 4000 6000 80000
1
2
3
4
5
6
7
Number of Nodes
TT
I(se
con
ds)
CentralizedHT-CollocatedBSFHT-Dedicated
• Hierarchical Tree (fanout 2) has best performance but has highest cost
Lowest TTI
Highest Cost
Best
• Centralized has best performance and lowest cost when <2000 nodes, but worst performance when >6000
Insights
• No static, ‘one size fits all’, topology• Design may tradeoff performance/cost• DPG can provide dynamic topology and
analytics variety support at large scale• Novel, hybrid topology can yield good
performance/cost. • The principles we follow in VScope.
Research Components
Modeling Monitoring/Analytics
System Design
VScope: Middleware for Troubleshooting Big Data APPs
Statistical Anomaly Detection: EbAT, Tukey,
Goodness-of-Fit
Anomaly Ranking Guidance
Statistical Anomaly Detection
• Distribution-based anomaly detection• Online• Integrated into VScope • Dynamically deployed by VScope
A Brief Summary
• Entropy-based Anomaly Tester (EbAT)• Leveraging Tukey Method and Chi-Square Test• Experiment on Real-World Data Center Traces
Conclusion• VScope is a scalable, dynamic, lightweight middleware for
troubleshooting real-time big data applications.
• We validate VScope in large-scale cloud environment with a realistic multi-tier stream processing benchmark.
• We showcase VScope’s abilities of troubleshooting horizontally across-tiers and vertically across-software-levels in two real-world use cases.
• Through analytical modeling, we concludes that dynamism, flexibility, and tradeoff between performance and cost are needed for large scale monitoring/analytics system design.
• We proposed statistical anomaly detection algorithms based on distribution change rather than change in individual measurements
State of the Art: System Analytics
Single host
Cluster
Data Center
Cloud
Multi-Tiersar
vmstat slick
Console mining
regression
Hyp. HQGanglia
ChukwaG.work
Osmius
top
ps
Moara
PMPOpenview/
Tivoli
magpie
pinpoint sherlock
Static
Dynamic
Ph.D. ThesisResearch Area
Scale
Complexity/Online
Dynamism
Lack systems and algorithms to support dynamic, online, complex diagnosis at large scale
Chopstix
Fay GWP
Dapper
CLUE
SIAT
Future Work
• System Analytics
• Large scale complexities, a variety of workloads, big data (system logs, application traces)
• Cloud Management (resource management, troubleshooting, migration planning, performance/cost analysis); Power Management; Performance optimization, etc.
• Investigating/Leveraging large scale, online, machine learning and data mining for system analytics
Thanks! Questions?
Backup Slides
VScope System Architecture
VNode
Initiate, Change, TerminateDPG DPG DPG
metric library
VShellfunction
libraryVMaster
VScope/DPG Operations
DPGManager DPGManager
agent Flume master
collector Xen Hypervisor
Dom0 DomU DomU
OpenTSDB
TSD TSD
HistoricalData
Query
Time-Series Daemon
Why Dynamism is Important?
We cannot afford tracing everywhere!
Distribution-based vs Value-based
• Sporadic Spikes• Pattern vs individual measurement
EbAT (Entropy-based Anomaly Tester)
Time Series Analysis
1. Exponential Weighted Moving Average (EWMA)
Signal Processing
1. Wavelet Analysis
Threshold-based
1. Visual Identification
2. Three-Sigma Rule
Entropy Time Series Construction
Look back windows
Look-backwindow ofSize 3
Example
2. Perform data pre-processing• Normalization: divide values by mean of samples• Data binning: hash values into a
bin of size m+1
1. Maintain look back window
Entropy Time Series Construction
4. Entropy Calculation• Determine count of each event
ei in the n samples (ni)
• Given v unique events ei in the n samples, entropy is calculated as
3. M-Event Creation for look-back window
Monitoring Event (M-Event)@sample s
<es1, es2, es3, …., esn>
Local and Global Entropies• Entropy timeseries is
created at every level of the cloud hierarchy
• Local entropy: Leaf level entropy timeseries (at every VM)
• uses raw monitoring data as input• Global entropy: Non-leaf level entropy timeseries (aggregated entropy)
• uses child entropy timeseries as input data• can calculate entropy of child entropies or aggregate it in other ways
Entropy Time Series Processing• Entropy calculation done for every look back window results in
an entropy time series
Examples
• Sharp changes in the entropy timeseries is tagged as anomaly (or using 3-sigma rule if assuming normal dist.)
• Visual analysis or signal processing can be used
3
Gaussian Distribution
-4 -3 -2 -1 0 1 2 3 4
Lower 3σ Limit Upper 3σ Limit
Previous Threshold DefinitionGaussian/normal distribution assumed for
data 68-95-99.7 rule
Fixed thresholds: 3
Remove Distribution Assumptions
• Tukey Method - No distribution assumption - For individual values• Goodness-Of-Fit Method - No distribution assumption - test if current distribution complies with the
normal distribution derived from history
Upper Threshold: Q1 - k|Q3-Q1|
Lower Threshold: Q3 + k|Q3-Q1|
Tukey Method
||3 131 QQQltl --=
||3 133 QQQutl -+=
||0.3||5.1 133133 QQQxQQQ i -+<£-+
||5.1||0.3 131131 QQQxQQQ i --<£--Possible Outliers
Observations falling beyond these limits are called serious
outliers
Goodness-of-Fit (GOF) TestLook back window
Empirical Distribution: P1History Distribution: P
Chi Square Goodness-of-Fit (P, P1)
Pass: Normal Fail: abnormal
Value I Near-optimum thresholds
Value II Static thresholds
Experiment Results of EbAT
Entropy I
Entropy II
Entropy-based aggregation method I: using E1+E2+E3+E1*E2*E3
Entropy-based aggregation method II: using entropy of child entropies
00.10.20.30.40.50.60.70.80.9
1
Threshold I Threshold II Entropy I Entropy II
Accuracy
0
0.05
0.1
0.15
0.2
0.25
0.3
Threshold I Threshold II Entropy I Entropy II
FAR
Average 57.4% improvement in accuracy and 59.3% reduction in false alarm rate
Accuracy False Alarm Rate
Value I Value II Entropy I Entropy II Value I Value II Entropy I Entropy II
Average 48% improvement in accuracy and 50% reduction in false alarms
0 0.2 0.4 0.6 0.8 1
Relative Entropy
Tukey
Gaussian (state of art)
Accuracy
0 0.02 0.04 0.06 0.08 0.1
Relative Entropy
Tukey
Gaussian (state of art)
FPR
Experiment of Tukey and GOF
False Alarm RateAccuracy
Normal
Tukey
GOF
Normal
Tukey
GOF