data-driven operations - practice realtime data analyse
DESCRIPTION
Grab data from any of logs and operations in realtime. Enable the power to find problem instantly. And make all operations based on data.TRANSCRIPT
Data-Driven OperationsPractice realtime data analyse
@khsing
Who am I
• Currently, I am a operations architect in SINA.
• Focus on automation tools and devops method
What kind of data is for operations?
Before we talk data
How is one day of ops?
• Check the Dashboard and looks good.
• Start work, write scripts or configurations
• Suddenly, Receiving alert SMS/Email or problem reported by CS.
• Start work with event/problem/outage
You are the Fireman http://www.flickr.com/photos/40699207@N05/3838012090/
Find the problem
• take a look at Dashboard, Nagios, and monitor
• grep logs from hundreds of host.
• watch the network diagram
• guess what is going wrong
Driven by problem
Passive
Be Active
Let’s talk data
datas
• Logs
• Access log, error log, exception log, step log
• Configuration Change log, Release log
• Performance Measurement
• Product operations data.
Logs
• Success is useless.
• Error is useful.
Process logs
• Realtime or near realtime take big benefit
• You can’t waste 1 hour when problem really happen
• You have to feel problem before too many users blame.
Process Logs
• Automatically category.
Normal logs
Categorised logs
Performance Measurement
• How fast when end-user visit our website?
• Where are they come from?
• Which datacenter are they visited?
• What the slow/fast user ratio?
Product Operations Data
• like DAU
• Drop, Spike, Increase are event, need take action.
Change/Release log
• Many problem come with Change or Release
• You have to watch those data after you did a change or release.
• Change/Release log have to visible on dashboard.
Change/Release log
Be active
Don’t defensive
–Olbrich Desouza
Attack is the best form of defence
Tools
• Splunk - commercial
• Logstash, ElasticSearch, Kibana
• Graphite
• StatsD
Q&A