monitoring and troubleshooting a real time pipeline

16
Monitoring and Troubleshooting a Real Time Pipeline Alan Ngai, CTO/Co-Founder, OpsClarity

Upload: apache-apex

Post on 15-Apr-2017

376 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Monitoring and Troubleshooting a Real Time Pipeline

Monitoring and Troubleshooting a Real Time PipelineAlan Ngai, CTO/Co-Founder, OpsClarity

Page 2: Monitoring and Troubleshooting a Real Time Pipeline

Businesses are Turning to Data-First Applications

AD Network – Real-time bidding

DDoS Attack Prevention

Fraud Detection

Internet of Things

Financial Services

Real-time Personalization

Page 3: Monitoring and Troubleshooting a Real Time Pipeline

Data-First Application: Many Moving Parts!

DATA SOURCE MESSAGE BROKER STREAM PROCESSOR

DATA SINK APPLICATIONS

DATA PIPELINE

ELASTIC INFRASTRUCTURE

BUSINESS LOGIC AS MICROSERVICES CODE

Page 4: Monitoring and Troubleshooting a Real Time Pipeline

OpsClarity Runs on Data Pipelines

Real TimeTopology

Real TimeHealth

Real TimeAnomaly Detection

Page 5: Monitoring and Troubleshooting a Real Time Pipeline

Characteristics of Data Pipelines• Heterogeneous

Components

Page 6: Monitoring and Troubleshooting a Real Time Pipeline

Characteristics of Data Pipelines• Heterogeneous

Components

• Extremely Complex

Storm Master Host

Storm Worker HostSupervisor Process

Topology

Executor

Spout Task

Bolt Task

Bolt Task

Bolt Task

METRIC STORM

Page 7: Monitoring and Troubleshooting a Real Time Pipeline

Characteristics of Data Pipelines• Heterogeneous

Components

• Highly Complex

• Highly Inter-dependent

Page 8: Monitoring and Troubleshooting a Real Time Pipeline

Characteristics of Data Pipelines• Heterogeneous

Components• Highly Interdependent• Highly Complex•Painful to Monitor and

Debug

Page 9: Monitoring and Troubleshooting a Real Time Pipeline

Put Data In One Place (don’t rely on this)

Kafka Web Console Spark UI Marvel (Elasticsearch)

Ambari (Hadoop) Ganglia Nagios

Page 10: Monitoring and Troubleshooting a Real Time Pipeline

Organize Your Concerns Horizontally

• Throughput• Latency• Error Rate• Buffered• Data Loss• Duplication

stuff per unit of time

how long it takes to process stuff

how frequently bad stuff happens

how much stuff is piled up

how much stuff is being lost

How much stuff is being duplicated

Matters for all stages in a pipeline!Matters for all business use cases too!

Page 11: Monitoring and Troubleshooting a Real Time Pipeline

Organize Your Concerns Horizontally

• Throughput• Latency• Error Rate• Buffered• Data Loss• Duplication

Page 12: Monitoring and Troubleshooting a Real Time Pipeline

…And Also Vertically

Where to start?!?!

Storm Master Host

Storm Worker HostSupervisor Process

Topology

Executor

Spout Task

Bolt Task

Bolt Task

Bolt Task

METRIC STORM

Page 13: Monitoring and Troubleshooting a Real Time Pipeline

…And Also VerticallyData Health

Dependency Health

Service Health

Application

Job/Topology Health

Node Service Health

Node System Health

throughput, latency, errors?

Are Kafka and Zookeeper healthy?

Is the Storm Master healthy? Are there adequate resources in the

cluster?Are my application KPI’s within

normal range?

Is my Job well distributed in the cluster? Are job counters normal?

Are all jobs running on this node normal?

Are key system metrics (cpu, mem, network, disk i/o) normal?

Data Health

Dependency Health

Service Health

Application

Job/Topology Health

Node Service Health

Node System Health

Page 14: Monitoring and Troubleshooting a Real Time Pipeline

DEMO

Page 15: Monitoring and Troubleshooting a Real Time Pipeline

What We Talked About• Data-First Applications Are Becoming a Thing• Monitoring Data-First Applications is Hard!• Get Your Metrics In One Place• Organize Your Data Horizontally and Vertically