challenges of monitoring distributed systems · challenges of monitoring distributed systems...
TRANSCRIPT
![Page 1: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/1.jpg)
![Page 2: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/2.jpg)
Challenges of Monitoring Distributed Systems
December 2017
Nenad [email protected]@smartcat.io
SmartCatwww.smartcat.io
@SmartCat_io
![Page 3: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/3.jpg)
![Page 4: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/4.jpg)
![Page 5: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/5.jpg)
Agenda
● Monitoring 101
● Metric data stream and tools
● Log data stream and tools
● Combine metrics and logs for full control
● Alerting
![Page 6: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/6.jpg)
Monitoring 101
• Monitoring domain consists of:
○ Metrics data stream
○ Log data stream
○ Alerting
![Page 7: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/7.jpg)
![Page 8: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/8.jpg)
Metrics Data Stream
![Page 9: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/9.jpg)
Metric data stream
• Metrics are indicators that everything is working within expected boundaries
• Easily forgotten and pushed aside when chasing deadlines
• Good dashboard has enough information (not too much, not too little)
Distributed system -> many graphs to watch -> information overload trap
![Page 10: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/10.jpg)
Metric data stream - decision
• SaaS solutions vs self-managed solutions
• Paying solutions vs free solutions
• Decision based on:
○ technical team skillset
○ level of control
○ security of data
![Page 11: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/11.jpg)
Metric data stream - stack
• Riemann as sink that handles events and sends them to Riemann server
• InfluxDB as NoSQL store which is build for measurements
• Grafana as visualization tool (flexible configurable graphs from many data
sources)
![Page 12: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/12.jpg)
![Page 13: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/13.jpg)
Log Data Stream
![Page 14: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/14.jpg)
Log data stream
• Metrics are indicator that something happened and logs provide context (what
happened)
• Log monitoring on single machine requires skill and knowledge
• Same challenges as with metrics (not too much, not too little)
Distributed system -> many terminals open -> information overload trap
![Page 15: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/15.jpg)
![Page 16: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/16.jpg)
Log data stream - decision
• SaaS solutions vs self-managed solutions
• Paying solutions and free solutions
• Decision based on:
○ technical team skillset
○ level of control
○ security of your data
![Page 17: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/17.jpg)
Log data stream - ELK stack
• ELK - ElasticSearch, LogStash, Kibana
• Filebeat is sending log messages from instances
• Logstash can filter, manipulate and transform messages
• ElasticSearch indexes log messages for easier searching
• Kibana is visualization tool with filtering capabilities
![Page 18: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/18.jpg)
![Page 19: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/19.jpg)
![Page 20: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/20.jpg)
Combine logs and
metrics
![Page 21: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/21.jpg)
Real world example
• Provide reliable latency guarantee for 99.999% request
• Whole infrastructure deployed on AWS
• Lot of metrics transferred to metrics machine
• We needed fine grained diagnostics for queries to database both on cluster
and application level among other things
![Page 22: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/22.jpg)
Combine logs and metrics
• It is much easier to look at graphs than logs
• Good metric coverage can pinpoint exact cause of problems
• Usually we need log messages to bring the context
• Grafana can combine InfluxDB (measurement data store) and ElasticSearch
(log index)
![Page 23: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/23.jpg)
![Page 24: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/24.jpg)
![Page 25: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/25.jpg)
Alerting
![Page 26: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/26.jpg)
![Page 27: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/27.jpg)
Alerting
• Alerting is giving you freedom not to look at graphs
• Someone else placed domain knowledge about alerts
• Alerting must not be frequent since you will end up ignoring alerts
Distributed system -> many alerts -> information overload trap
![Page 28: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/28.jpg)
![Page 29: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/29.jpg)
Sentinel - SMART Alerting
• Alerts are build by humans, we make false assumptions
• Correlation between features in most alerting system is not supported
• Why not let the machine find anomalies
• Have snapshot of the system at moment something happened
• Have diagnostic messages with cause of error
![Page 30: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/30.jpg)
Sentinel - SMART Alerting
![Page 31: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/31.jpg)
Sentinel - SMART Alerting
![Page 32: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/32.jpg)
Conclusion
![Page 33: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/33.jpg)
Takeaways
• Have right amount of information, not too much, not too little
• Having good selection of metrics and logs is iterative process
• Do not end up fixing monitoring machine instead of fixing application code
• Be proactive, not reactive
• Tailor metrics by your needs, build tools if there are not any that suite your use
case
![Page 34: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/34.jpg)
Links
• Monitoring stack for distributed systems - SmartCat blog post
• Distributed logging - SmartCat blog post
• Metrics collection stack for distributed systems - SmartCat blog post
• Monitoring machine ansible project (Riemann, Influx, Grafana, ELK) -
SmartCat github project
Twitter@NenadBozicNs
![Page 35: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/35.jpg)
Q&A
![Page 36: Challenges of Monitoring Distributed Systems · Challenges of Monitoring Distributed Systems December 2017 Nenad Bozic @NenadBozicNs nenad.bozic@smartcat.io SmartCat @SmartCat_io](https://reader033.vdocuments.net/reader033/viewer/2022042108/5e888f5828fa4d197968ffaa/html5/thumbnails/36.jpg)
Thank you
Nenad Bozic@NenadBozicNs
SmartCatwww.smartcat.io
@SmartCat_io