application performance troubleshooting 1x1 - part 2 - noch mehr schweine und schlangen

87
Noch mehr Schweine und Schlangen Neue Tipps zum Performance Troubleshooting Rainer Schuppe AppDynamics GmbH

Upload: rschuppe

Post on 18-Dec-2014

69 views

Category:

Technology


2 download

DESCRIPTION

Application Performance doesn't come easy. How to find the root cause of performance issues in modern and complex applications? All you have is a complaining user to start with? In this presentation (mainly in German, but understandable for english speakers) I'd reprised the fundamentals of trouble shooting and have some new examples on how to tackle issues. Follow up presentation to "Performance Trouble Shooting 101 - Schweine, Schlangen und Papierschnitte"

TRANSCRIPT

Page 1: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Noch mehr Schweine und Schlangen

Neue Tipps zum Performance Troubleshooting

Rainer SchuppeAppDynamics GmbH

Page 2: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Rainer

+Rainer Schuppe

Customer SupportSystem Support / OpsConsultant / DevSolution ArchitectSales Engineer

Page 3: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Where to start? What to do? Who to blame?

Tooling

Usecases - Symptoms & Diagnostics

Reprise: Why care about performance

Page 4: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Page 5: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

failure everyday

Page 6: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

BIG DATA

Hadoop Cassandra MongoDB

Coherence

Memcached

CLOUD

Amazon EC2 Windows Azure

VMWare

Login Search Flight

View Flight Status Make Reservation

Weblogic Oracle

.NET

MQ

ATG, Vignette, Sharepoint

SQL Server

JBoss

Tomcat

Tomcat

Mule, Tibco, AG

ESB

.NET

Tomcat

SOA

WEB 2.0

Browser Logic AJAX Web Frameworks

Release 3.4 Release 3.5 Release 3.6 Release 4.0

AGILE

Release 1.1 Release 1.2 Release 1.23 Release 1.5

Release 4.4 Release 4.5 Release 4.6 Release 5.0

Release 2.4 Release 2.5 Release 2.6 Release 3.0

Release 1.4 Release 1.5 Release 1.6 Release 2.0

Release 1.4 Release 1.5 Release 1.6 Release 2.0

Complexity increases

Page 7: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Generic Troubleshooting Process

DiagnosisTriageAlert / Detection

Fix Solution Finding

Rootcause Detection

Move on with life

Data / Information

Page 8: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Determine who needs to fix it

• Starts with overview and comparison to „normal“ performance

• First level task (Operators)

• First indication of problem type

• Works best with transactional data

Triage

Page 9: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

BIG DATA

Hadoop Cassandra MongoDB

Coherence

Memcached

CLOUD

Amazon EC2 Windows Azure

VMWare

Login Search Flight

View Flight Status Make Reservation

Weblogic Oracle

.NET

MQ

ATG, Vignette, Sharepoint

SQL Server

JBoss

Tomcat

Tomcat

Mule, Tibco, AG

ESB

.NET

Tomcat

SOA

WEB 2.0

Browser Logic AJAX Web Frameworks

Release 3.4 Release 3.5 Release 3.6 Release 4.0

AGILE

Release 1.1 Release 1.2 Release 1.23 Release 1.5

Release 4.4 Release 4.5 Release 4.6 Release 5.0

Release 2.4 Release 2.5 Release 2.6 Release 3.0

Release 1.4 Release 1.5 Release 1.6 Release 2.0

Release 1.4 Release 1.5 Release 1.6 Release 2.0

100 ms

50 ms

45,3 ms50 ms

10 ms60 ms

150 ms160 ms

145 ms

145 ms145 ms

145 ms145 ms

10 ms

250 ms

300 ms300 ms

310 ms

15 ms

1 ms 250 ms

Page 10: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

BIG DATA

Hadoop Cassandra MongoDB

Coherence

Memcached

CLOUD

Amazon EC2 Windows Azure

VMWare

Login Search Flight

View Flight Status Make Reservation

Weblogic Oracle

.NET

MQ

ATG, Vignette, Sharepoint

SQL Server

JBoss

Tomcat

Tomcat

Mule, Tibco, AG

ESB

.NET

Tomcat

SOA

WEB 2.0

Browser Logic AJAX Web Frameworks

Release 3.4 Release 3.5 Release 3.6 Release 4.0

AGILE

Release 1.1 Release 1.2 Release 1.23 Release 1.5

Release 4.4 Release 4.5 Release 4.6 Release 5.0

Release 2.4 Release 2.5 Release 2.6 Release 3.0

Release 1.4 Release 1.5 Release 1.6 Release 2.0

Release 1.4 Release 1.5 Release 1.6 Release 2.0

�Problem

Page 11: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Key:

= bad

= not bad

Page 12: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Determine the root of the problem

• Uses first level information to narrow scope

• Needs specialists

• Lots of data / information needed in real time and historical

• Usually needs iterations

• More than 1 tool used in the process

Diagnose

Page 13: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Confirm the rootcause after you diagnosed it

• Document it

• Recreate it in test if possible

• Needs the same data as diagnostics

Rootcause detection

Page 14: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Find a solution for the problem

• Architect a workaround or a fix

• Again needs the diagnostic data

• Run some test runs with different options - check them in realtime

• Confirm the idea for the fix

• May be a different team then the trouble shooters

Solution finding

Page 15: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Intuition

• Experience

• Tools

• Logfiles

• Communication

How to get the data?

Page 16: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

© val-j - sxc.hu

Tooling

Page 17: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Concurrency Data Volume Resource

3 Key Things Impact Performance & Availability

Page 18: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Development

Data Volume ResourceConcurrency

QA/Test

Data Volume ResourceConcurrency

Production

Data Volume ResourceConcurrency

Why do things crash and slow down?

Page 19: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

LoggingARMBytecode Instrumentation / AspectsSamplingJMX (Java Management Extensions)PMI (IBM WebSphere specific)

Technologies DevTestProd

Page 20: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Pros:

• Anything can be logged

• Easy to implement (if you have the sourcecode)Cons:

• Only what the developer thinks is needed

• I/O heavy

• No chance for change if you don‘t own the source code

• Lots of files - no TX context usually

• How to correlate in distributed environment?

Logfiles DevTestProd

Page 21: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Logging can be the source of problems itselfe.g. Log4Net

• Synchronous local file system access

• The more you log the longer it takes

• Can only be diagnosed with another tool

Logfiles - 2 DevTestProd

Page 22: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Bringing down production with Logging

Microsoft.Win32.Win32Native:SetFilePointerWin32

Page 23: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Logfiles - 4[#|2013-04-16T16:04:44.319+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|

_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Data Store timer|#]

[#|2013-04-16T16:04:44.335+0200|INFO|sun-appserver2.1|com.appdynamics.TOP.SUMMARY.STATS.WRITE|_ThreadID=14;_ThreadName=pool-1-thread-9;|START TIME for timer service(TopSummaryStatsWriterTimerTaskBean) will be: Tue

Apr 16 16:05:00 CEST 2013|#]

[#|2013-04-16T16:04:44.338+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Data Store timer|#]

[#|2013-04-16T16:04:44.338+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Data Purger timer|#]

[#|2013-04-16T16:04:44.369+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Data Purger timer|#]

[#|2013-04-16T16:04:44.369+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Detail String cache timer|#]

[#|2013-04-16T16:04:44.376+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Detail String cache timer|#]

[#|2013-04-16T16:04:44.376+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats rollup timer|#]

Page 24: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Pros:

• No config needed

• Lots of data - lots of detailCons:

• Lots of data - not suitable for production

• Needs experience

• No transactional concept / context

Profiler DevTest

Page 25: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Profiler

Page 26: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Pros:

• Built into most application servers

• JConsole is part of the JDK

• Easy to implement MBeansCons:

• No transaction context

• Not available for 3rd party

• No historical data

• Usually one JVM only

JMX (and similar) DevTestProd

Page 27: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

JMX (and similar)

Page 28: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Pros:

• They are free

• Transaction context (most of them)

• Quick setup (the commercial ones)Cons:

• Usually functionally constrained (commercial)

• Hard to configure (open source)

• Usually no history

APM tools (free) DevTestProd

Page 29: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Pros:

• Transactions, Historical data

• Distributed monitoring

• Deep dive diagnostics

• Production fitCons:

• Costly

• Choose the right one

APM tools (commercial) DevTestProd

Page 30: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

There are just 2 sorts of issues

Diagnosis

Page 31: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

© NLTeddy - sxc.hu

Page 32: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

© ross666 - sxc.hu

Page 33: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Constantly slow (Turtle)• Slowly, but constantly slower• Exponentially slower• Suddenly slower • Sporadically slow• Spontaneous crash

50 shades of slow (appx.)

Page 34: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Sudden outage• Always erroneous• Sporadically Errormessages• Silent death / Bleed to death• Increasing errorrates• Wrong / meaningless error messages

The wonderful world of errors

Page 35: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Look at symptoms• Eliminate definite non-causes• Prioritize the suspicions• Confirm suspicion / Eliminate suspicion

• Compare with „normal“• Gather more information• Define root cause and confirm it• Redo from Start

Diagnosis – Rough Flow

Page 36: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Bad Coding• Too much load• Backend not reachable / slow• Conflicting resources• Memory Leak• Resource Leak• Network / Hardware Problem

Possible Causes(in no particular order)

Page 37: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Example

• Resource Contention

• Exceptions

• Load Issues

Page 38: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Symptoms

• User complaints - Slow performance

• Exceptions appeared in logfiles

• Alerts for ops triggered

Page 39: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

0

2.750

5.500

8.250

11.000

time 10:01 10:03 10:05 10:07 10:09 10:11 10:13 10:15 10:17 10:19 10:21 10:23 10:25 10:27 10:29

Average Response Time (ms)

Connection Timeout

Page 40: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

0

3,75

7,50

11,25

15,00

10:00 10:02 10:04 10:06 10:08 10:10 10:12 10:14 10:16 10:18 10:20 10:22 10:24 10:26 10:28 10:30

Connection Pool vs. Errors

org.hibernate.util.JDBCExceptionReporter : Cannot get a connection, pool error Timeout waiting for idle object

Page 41: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

1st Diagnosis

• OK - We do have a problem

• Database connection pool depleted

• Waiting times stacking

• 10 minutes until errors appear in logs

• But WHY?

Page 42: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Open Questions

• Which database?

• Which DB Pool?

• Transaction specific?

• Problem on DB?

• What is the load?

Page 43: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

How to find data

• Check log for DB connection info

• Ask architect which TX are using this pool

• Use JMX to check pool metrics

• Check load info (if available)

Page 44: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Page 45: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Page 46: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Who else is using the DB?

Ask your architect!

Page 47: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Who else is using the DB?Ask your tool!

Page 48: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

What did we find out?

• There were other TX‘s using the DB

• It was just a single transaction with the DB problem

• This TX had a specific DB connection pool

OK - Now let‘s check the load

Page 49: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Trx/min

Avg RTPool LimitPool Usage

Trx Stalls

Load and DB connections

Why the sudden load increase?

Page 50: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Root Cause

• Loadbalancer was not working correctly

• DB connection pool size was not appropriate for this load

• Many different pools made this config necessary

Page 51: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

BIG DATA

Hadoop Cassandra MongoDB

Coherence

Memcached

CLOUD

Amazon EC2 Windows Azure

VMWare

Login Search Flight

View Flight Status Make Reservation

Weblogic Oracle

.NET

MQ

ATG, Vignette, Sharepoint

SQL Server

JBoss

Tomcat

Tomcat

Mule, Tibco, AG

ESB

.NET

Tomcat

SOA

WEB 2.0

Browser Logic AJAX Web Frameworks

Release 3.4 Release 3.5 Release 3.6 Release 4.0

AGILE

Release 1.1 Release 1.2 Release 1.23 Release 1.5

Release 4.4 Release 4.5 Release 4.6 Release 5.0

Release 2.4 Release 2.5 Release 2.6 Release 3.0

Release 1.4 Release 1.5 Release 1.6 Release 2.0

Release 1.4 Release 1.5 Release 1.6 Release 2.0

PurchaseSearch FlightFlight Status

Login

Browser(s)

NetworkNativeMobile

App

MOBILE

The missing link

Page 52: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Enduser Monitoring helps

Page 53: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Complete the picture

Page 54: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

A quick look at memory

and resources

Page 55: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Linear Memory Leak• Symptoms:

• OOM (Out of memory error)• Slow over time with spikes• Sawtooth with upward trend

• Causes• Objects added to linear structures without being removed

(e.g., linked lists)• Other API misuse (addListener() without corresponding

removeListener(), etc.)

Page 56: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Aggregate detection: • linear growth in heap utilization• GC time growth

• Specific detection:• Figure out object types being leaked• Verbose GC• Find related APIs and search code for misuse

Linear Memory Leak

Page 57: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Challenges• References - many small objects are referenced in one

collection• Death by 1000 cuts (Papierschnitte)

• Specific detection:• Figure out object types being leaked• Verbose GC• Find related APIs and search code for misuse

Linear Memory Leak

Page 58: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Heap Dump Comparison• Needs at least 2 dumps• Stops the JVM• Can take several minutes each• Creates tons of data• Finds the object, not the code responsible for the leak

• Profiler• High overhead - not for production• Lots of data

• APM Solution• Collection based algorithm – finds only collection leaks• Instance counting• Trade off between low overhead and usefulness of data

Specific detection

Page 59: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Page 60: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Page 61: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Page 62: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Page 63: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes:• Objects added to most data structures

without being removed (e.g., vectors, hashtables)

• Other API misuse (as Linear Leak)• Aggregate detection:

• exponential growth in heap• Specific detection:

• Same as Linear Leak

Exponential Memory Leak

Page 64: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes:• API misuse of Java objects with resource-

style lifecycle (create->use->destroy)• Aggregate detection:

• Slow over time• Growth in heap (if you’re lucky)

• Specific detection:• Audit code for API misuses• Object instance tracking

Resource Leak

Page 65: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Resource conflict (block / wait)

Page 66: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes:• Overcautious data integrity strategy• Synchronising is always good

• Aggregate detection: • Stalled threads• High thread usage - low CPU usage

• Specific detection:• Thread dumps as needed• Stack traces / graphs• CPU block / wait timing measurement

Resource conflict / blocking

Page 67: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Fragen ?

Page 68: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes:• Infinite loop in code

• Aggregate detection: • Stalled threads• Permanently high usage of CPU / threads

• Specific detection:• Thread dumps as needed• Stack traces / graphs

Bad Coding: Infinite Loop

Page 69: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes:• Idiot with a “Learn Java in 24 Hours” book

• Aggregate Detection: • Response time measurement• Aggregate CPU utilization

• Specific Detection: • Detailed CPU utilization

• Typical Cure:• Cache of data or of performed calculations

Bad Coding: CPU-Bound Component

Page 70: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes:• Poorly implemented data bridge layer, or simply

too many of them• DB -> XML -> XSLT -> More XML -> “Custom

Data Management Layer” -> Consumer

• Aggregate Detection: • Response time measurements

• Specific Detection: • Call graphs - Call trace (stack trace not

enough)• Ask for a design or architecture document

Layer-itis

Page 71: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes:• Hibernate fixes everything• Massive SQL statements (length and amount)• Wrong data strategy

• Aggregate Detection: • Response time measurements• DB time measurements

• Specific Detection: • Call stacks / snapshots

O/R Mapper misuse

Page 72: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Caching issues

Page 73: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes: • Continual attempts to call backend +

unavailable backend• Aggregate Detection / Specific Detection:

• Response time measurement• Backend detection - measurement (time

& # of calls)• Stalled TX count• Exceptions • Busy thread count

The Unending Retry

Page 74: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

don’t forget about thrown exceptions

Page 75: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes: • Fundamental error in threading / lock

acquisition strategy• Aggregate Detection:

• Stalled threads / permanently high concurrent usage

• Specific Detection: • Deadlock detection in JVM• Thread dumps• Busy thread count

Threading: Deadlock / Livelock

Page 76: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Found one Java-level deadlock:============================="Thread-2":  waiting to lock monitor 102054308 (object 7f3113800, a java.lang.Object),  which is held by "Thread-1""Thread-1":  waiting to lock monitor 1020348b8 (object 7f3113810, a java.lang.Object),  which is held by "Thread-2" Java stack information for the threads listed above:==================================================="Thread-2":    at DeadlockTest$2.run(DeadlockTest.java:42)    - waiting to lock <7f3113800> (a java.lang.Object)    - locked <7f3113810> (a java.lang.Object)    at java.lang.Thread.run(Thread.java:680)"Thread-1":    at DeadlockTest$1.run(DeadlockTest.java:26)    - waiting to lock <7f3113810> (a java.lang.Object)    - locked <7f3113800> (a java.lang.Object)    at java.lang.Thread.run(Thread.java:680) 

Threading: Deadlock

Page 77: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes: • Many threads bottlenecked waiting for

one lock• Aggregate Detection:

• Stalled threads / high concurrent usage• Exponential slowness• Low CPU usage

• Specific Detection: • Request response time monitoring• CPU block / wait timing

Threading: Chokepoint

Page 78: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Threading: Chokepoint

Page 79: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes: • Overusage of internal resource (threads,

database connections, etc.)• Underallocation of same

• Aggregate Detection:• Stalled threads / high concurrent usage• Call rate and average response time of internal

resource• Specific Detection:

• Also compare with methods from Resource Leak, External Bottleneck, and Overusage of External System

Internal Resource Bottleneck

Page 80: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes: • External system (database, authentication server) is

slow• Compare with Overusage of external system

• Aggregate Detection:• Response time on backend calls• Exceptions

• Specific Detection: • Callgraphs• Specific monitoring on those backends

External Bottleneck

Page 81: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Commit happy

Page 82: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Trx/min

Avg RTPool LimitPool Usage

Trx Stalls

Production Ground to a halt for 2 hours And again the next day

Page 83: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• Causes: • Poor design or tuning of interaction with backend system

(e.g., join between two million-row tables for each user logon)

• O/R mapper misconfiguration• Aggregate Detection:

• Response time measurement• Specific Detection:

• Timing on backend systems• Also need tools for those backend systems

Overusage of External System

Page 84: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

excessive database access

Page 85: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

query too much data

Page 86: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Copyright © 2013 AppDynamics. All rights reserved.

Applications will become more Complex and Change Faster

86

PurchaseSearch FlightFlight Status

Login

Browser(s)

Network

Oracle.NET Service

NativeMobile

App

MOBILE

CDN

WEB 2.0

AGILERelease 1.1Release 1.2Release 1.23Release 1.5

MonolithicDistributed

ApacheESB/MQ

JBoss Service

WebLogic Service

PHP Service

3rd Party Web Service

SOA

Tomcat Service

VMWarePrivate

JBoss Service

Amazon EC2Public

Hadoop

CLOUD

Sybase

MySQL

SQL Server

PostgreSQL

BIG DATA

Cassandra

Memcached

NOSQL

DB2

Page 87: Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

• One interesting problem occurs when the size of transactions with backend systems needs to be tuned

• Can be intertwined with / exacerbated by Layer-itis and Overusage of External System

Many small requests

System constantly wastes resources

dispatching / unmarshalling many xactions and results

“Death by a thousand cuts”

One HUGE request

System periodically slows to a crawl as many resources get

thrown at large chunk of work

“Pig in a Python”

“Just Right”