keeping our websites running - troubleshooting with appdynamics benoit villaumie lead architect...

Download Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager

If you can't read please download the document

Upload: magnus-owen

Post on 17-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager
  • Slide 2
  • Introduction Our Company Karavel Founded 2001 #1 package Travel Website in France 4 Million unique visitors a month Mainly B2C, but also B2B 15 brands, 10 white label One M&A every Year
  • Slide 3
  • Our Application History 2008 Monolithic Years Tomcat, MySql Expensive to maintain & Scale Too Big To Fail 2009 Distributed SOA Tomcat, Web Services & MySql Easier to maintain & scale Became incredibly complex to manage Design for failure
  • Slide 4
  • Managing this Complexity History of Architecture Issues Slow SQL Queries, Timeout & Pool Exhaustion Slow 3 rd Party Web Services Open Source Framework Bugs Resource & Memory Leakages Long and Painful Firefighting Plenty of log Files on multiple servers Thread & Heap Dumps Few jmx metrics, but never the needed one Lack of Historical data
  • Slide 5
  • Our AppDynamics Experience Who ? Today 50+ people in Karavel use AppDynamics: Products Owners Developers Architects Ops
  • Slide 6
  • Our AppDynamics Experience Root Causes Memory Leakage Over Consumption Performance Regression Application Bugs Architectural Changes Infrastructure Changes
  • Slide 7
  • Our AppDynamics Experience Methodology Discard quickly wrong hypotheses => wide spectrum investigation Investigate deeper interesting ones Once under control, create alerts and dashboards Communicate the methodology to the team
  • Slide 8
  • Commons Issues
  • Slide 9
  • Commons Issues : Response Time
  • Slide 10
  • Slide 11
  • Slide 12
  • Analyze functionality on cluster / node response time cluster mean response time node mean response time
  • Slide 13
  • Commons Issues : Response Time
  • Slide 14
  • Analyze functionality by Business Transaction BT mean response time All BT mean response time
  • Slide 15
  • Commons Issues : Response Time
  • Slide 16
  • related to a resource consumed by the application (databases, webservices, ) related to a performance regression implementation request snapshot & drill down functionality
  • Slide 17
  • Commons Issues : Response Time
  • Slide 18
  • Analyze functionality on CPU GC Time Spent / mn (ms) vs CPU Time Spent / mn (ms) CPU ms / mn GC CPU ms / mn x100 (but depend of your code)
  • Slide 19
  • Commons Issues : Response Time
  • Slide 20
  • related to Garbage Collecting OverActivity/!\ memory problem Analyze functionality on GC Time Spent / mn (ms) memory used GC Time Spent / mn (ms)
  • Slide 21
  • Commons Issues : Response Time
  • Slide 22
  • related to a resource leak (CPU, FD, ) related to a selfish process that dries server resources (CPU, Thread, FD) Analyze functionality Then class/method found by Thread Dump Or ps, vmstat, top Nb of thread
  • Slide 23
  • Commons Issues : Errors
  • Slide 24
  • /!\ errors do not mean broken user experience meteo is broken
  • Slide 25
  • Commons Issues : Errors Identify the error kind and the business transactions Troubleshoot > Error rates, then choose the error class that has a drop in number
  • Slide 26
  • Commons Issues : Errors Identify the error kind and the business transactions Troubleshoot > Error rates > details
  • Slide 27
  • Commons Issues : Memory
  • Slide 28
  • Memory Problem Monitor > Application Infrastructure > Memory
  • Slide 29
  • Commons Issues : Memory Memory leak, look at Tenured Gen Behavior
  • Slide 30
  • Commons Issues : Memory Then, investigate Object Instance Tracking
  • Slide 31
  • Commons Issues : Memory Memory overconsumption, look at Eden Space
  • Slide 32
  • Commons Issues : Memory Then, investigate Object Instance Tracking (again)
  • Slide 33
  • Commons Issues : Memory But sometimes, your VM needs only more memory Why ? Ask the developers. They should know (?)
  • Slide 34
  • Commons Issues : Backend C process Mysql backend
  • Slide 35
  • Commons Issues : Backend
  • Slide 36
  • Slide 37
  • How to monitor a legacy C socket process ? Get minimal info and set alert from the consumer process
  • Slide 38
  • Commons Issues : Backend We have a problem Mean response time
  • Slide 39
  • Commons Issues : Backend Max response time Mean response time Timeout not normal behavior Contact the editor
  • Slide 40
  • Commons Issues : Backend New version Editor forces us to stop monitoring Another version Mean response time
  • Slide 41
  • Alerts & Dashboards
  • Slide 42
  • Alerts & Dashboards : proactive detection Reduce Mean Time Detection NOC Dashboard > Health status on critical Business Transaction NOC Dashboard
  • Slide 43
  • Alerts & Dashboards : proactive detection Alerts (ops & devs) : on response time on err/mn on stall Application Health Alerts Criteria
  • Slide 44
  • Alerts & Dashboards : simplify resolution reduce Mean Time Resolution Application Health Dashboard cluster response time node response time node error rate node call number Application Health Dashboard
  • Slide 45
  • Alerts & Dashboards : simplify resolution reduce Mean Time Resolution Infrastructure Health Dashboard node memory usage node CPU usage node Thread number Infrastructure Health Dashboard
  • Slide 46
  • Weekly Review Alerting is fine BUT some regressions may not be detected response time degradation on 4 weeks
  • Slide 47
  • Weekly Review Our Dashboard Safety Belt Weekly Performance Review Weekly Error Review (coming soon) Weekly Performance Dashboard
  • Slide 48
  • Capacity planning How to ease : software tuning hardware renew Event planning
  • Slide 49
  • Capacity planning
  • Slide 50
  • Slide 51
  • Next Steps Use Workflows and automatic Remediations Integrate Splunk Tag deployment event inside AppDynamics Improve knowledge sharing among customers
  • Slide 52
  • Questions ?