analytics driven operations - steve acreman - dataloop
TRANSCRIPT
![Page 2: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/2.jpg)
What is Dataloop?
PerformanceUp / Down Alerts
Dev Env Enterprise Stuff
![Page 3: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/3.jpg)
Architecture
![Page 4: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/4.jpg)
First Year
![Page 5: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/5.jpg)
First Year
![Page 6: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/6.jpg)
Measure
![Page 7: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/7.jpg)
Putting out the fire
rollup workermetric worker
![Page 8: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/8.jpg)
Problems
• NodeJS metrics workers not scaling
• Memory management was an issue
• Needed big caches to reduce database
load
• GC cycles too long
• 8 x single processes on an 8 core server
![Page 9: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/9.jpg)
Metric worker re-write
• Approximately 6 weeks from no Erlang experience to working
version
• No more crashes
• Reduced servers needed from 16 to 8
• Pushes metrics straight from Rabbit into DalmatinerDB (new
database)
![Page 10: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/10.jpg)
Today
![Page 11: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/11.jpg)
Happy Ending
![Page 12: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/12.jpg)
Just the beginning!
![Page 13: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/13.jpg)
Initial Instrumentation
› StatsD libraries in Node and Erlang code› Push UDP packets to a StatsD server for aggregation
![Page 14: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/14.jpg)
Pitfalls
› Metrics increase as service usage increases
› UDP isn’t great
› Aggregates across a service (hard to spot an outlier)
› Quite lossy
![Page 15: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/15.jpg)
Better Instrumentation
› Prometheus http metrics endpoints
› 10 second scrape interval into Dataloop
› Raw data (no loss)
› Dimensions allow drill down into host
![Page 16: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/16.jpg)
Prometheus Output
curl http://localhost/metrics
![Page 17: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/17.jpg)
What to instrument?› Everything!
› Feature usage
› Throughput
› Error rates
› If it moves instrument it
![Page 18: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/18.jpg)
Analytics
› Simple things like API response times
![Page 19: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/19.jpg)
Analytics› Pretty useful to plot when a problem started
![Page 20: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/20.jpg)
Yesterday vs. Today
![Page 21: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/21.jpg)
SQL Like Query Language
![Page 22: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/22.jpg)
Time Series Functions
› Create a query to answer questions
![Page 23: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/23.jpg)
Future
› Prediction algorithms
› Search ‘similar’ metrics
› Outlier algorithms
› More functions!
![Page 24: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/24.jpg)
Summary
› Code level metrics with Prometheus are extremely light weight
› Have a framework in place to quickly add more when issues arise
› Don’t wait until your first fire to start
› Start small and try to get both operations and developers on board
![Page 25: Analytics driven operations - Steve Acreman - Dataloop](https://reader035.vdocuments.net/reader035/viewer/2022062311/58ee20681a28abc5088b45b9/html5/thumbnails/25.jpg)
Q&A