keeping our websites running - troubleshooting with appdynamics benoit villaumie lead architect...
Post on 17-Dec-2015
212 Views
Preview:
TRANSCRIPT
- Slide 1
- Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager
- Slide 2
- Introduction Our Company Karavel Founded 2001 #1 package Travel Website in France 4 Million unique visitors a month Mainly B2C, but also B2B 15 brands, 10 white label One M&A every Year
- Slide 3
- Our Application History 2008 Monolithic Years Tomcat, MySql Expensive to maintain & Scale Too Big To Fail 2009 Distributed SOA Tomcat, Web Services & MySql Easier to maintain & scale Became incredibly complex to manage Design for failure
- Slide 4
- Managing this Complexity History of Architecture Issues Slow SQL Queries, Timeout & Pool Exhaustion Slow 3 rd Party Web Services Open Source Framework Bugs Resource & Memory Leakages Long and Painful Firefighting Plenty of log Files on multiple servers Thread & Heap Dumps Few jmx metrics, but never the needed one Lack of Historical data
- Slide 5
- Our AppDynamics Experience Who ? Today 50+ people in Karavel use AppDynamics: Products Owners Developers Architects Ops
- Slide 6
- Our AppDynamics Experience Root Causes Memory Leakage Over Consumption Performance Regression Application Bugs Architectural Changes Infrastructure Changes
- Slide 7
- Our AppDynamics Experience Methodology Discard quickly wrong hypotheses => wide spectrum investigation Investigate deeper interesting ones Once under control, create alerts and dashboards Communicate the methodology to the team
- Slide 8
- Commons Issues
- Slide 9
- Commons Issues : Response Time
- Slide 10
- Slide 11
- Slide 12
- Analyze functionality on cluster / node response time cluster mean response time node mean response time
- Slide 13
- Commons Issues : Response Time
- Slide 14
- Analyze functionality by Business Transaction BT mean response time All BT mean response time
- Slide 15
- Commons Issues : Response Time
- Slide 16
- related to a resource consumed by the application (databases, webservices, ) related to a performance regression implementation request snapshot & drill down functionality
- Slide 17
- Commons Issues : Response Time
- Slide 18
- Analyze functionality on CPU GC Time Spent / mn (ms) vs CPU Time Spent / mn (ms) CPU ms / mn GC CPU ms / mn x100 (but depend of your code)
- Slide 19
- Commons Issues : Response Time
- Slide 20
- related to Garbage Collecting OverActivity/!\ memory problem Analyze functionality on GC Time Spent / mn (ms) memory used GC Time Spent / mn (ms)
- Slide 21
- Commons Issues : Response Time
- Slide 22
- related to a resource leak (CPU, FD, ) related to a selfish process that dries server resources (CPU, Thread, FD) Analyze functionality Then class/method found by Thread Dump Or ps, vmstat, top Nb of thread
- Slide 23
- Commons Issues : Errors
- Slide 24
- /!\ errors do not mean broken user experience meteo is broken
- Slide 25
- Commons Issues : Errors Identify the error kind and the business transactions Troubleshoot > Error rates, then choose the error class that has a drop in number
- Slide 26
- Commons Issues : Errors Identify the error kind and the business transactions Troubleshoot > Error rates > details
- Slide 27
- Commons Issues : Memory
- Slide 28
- Memory Problem Monitor > Application Infrastructure > Memory
- Slide 29
- Commons Issues : Memory Memory leak, look at Tenured Gen Behavior
- Slide 30
- Commons Issues : Memory Then, investigate Object Instance Tracking
- Slide 31
- Commons Issues : Memory Memory overconsumption, look at Eden Space
- Slide 32
- Commons Issues : Memory Then, investigate Object Instance Tracking (again)
- Slide 33
- Commons Issues : Memory But sometimes, your VM needs only more memory Why ? Ask the developers. They should know (?)
- Slide 34
- Commons Issues : Backend C process Mysql backend
- Slide 35
- Commons Issues : Backend
- Slide 36
- Slide 37
- How to monitor a legacy C socket process ? Get minimal info and set alert from the consumer process
- Slide 38
- Commons Issues : Backend We have a problem Mean response time
- Slide 39
- Commons Issues : Backend Max response time Mean response time Timeout not normal behavior Contact the editor
- Slide 40
- Commons Issues : Backend New version Editor forces us to stop monitoring Another version Mean response time
- Slide 41
- Alerts & Dashboards
- Slide 42
- Alerts & Dashboards : proactive detection Reduce Mean Time Detection NOC Dashboard > Health status on critical Business Transaction NOC Dashboard
- Slide 43
- Alerts & Dashboards : proactive detection Alerts (ops & devs) : on response time on err/mn on stall Application Health Alerts Criteria
- Slide 44
- Alerts & Dashboards : simplify resolution reduce Mean Time Resolution Application Health Dashboard cluster response time node response time node error rate node call number Application Health Dashboard
- Slide 45
- Alerts & Dashboards : simplify resolution reduce Mean Time Resolution Infrastructure Health Dashboard node memory usage node CPU usage node Thread number Infrastructure Health Dashboard
- Slide 46
- Weekly Review Alerting is fine BUT some regressions may not be detected response time degradation on 4 weeks
- Slide 47
- Weekly Review Our Dashboard Safety Belt Weekly Performance Review Weekly Error Review (coming soon) Weekly Performance Dashboard
- Slide 48
- Capacity planning How to ease : software tuning hardware renew Event planning
- Slide 49
- Capacity planning
- Slide 50
- Slide 51
- Next Steps Use Workflows and automatic Remediations Integrate Splunk Tag deployment event inside AppDynamics Improve knowledge sharing among customers
- Slide 52
- Questions ?
top related