java gc - pause tuning
DESCRIPTION
English version of the presentation we gave at Devoxx FR 2012.In depth analysis on how java Garbage collector works and how to minimise pause in your application.TRANSCRIPT
Everything you ever wanted to know about GC pauses**but were afraid to ask
1
Death by pauses
Tuesday, July 10, 12
Agenda
1. Introduction
2. Crime Scene Investigation
3. JVM Memory management systems and tools
4. Putting it together
2Tuesday, July 10, 12
The Crime ScenePG 13*
* Parents strongly cautioned: typed language, dead objects and verbose logs may not be suitable to scripting language fans
3Tuesday, July 10, 12
4
Apa
che
Tom
cat
Ora
cle
B2Ce-commerce platform
Tuesday, July 10, 12
4
Apa
che
Tom
cat
Ora
cle
B2Ce-commerce platform
•12+ Servers•10 different Webapps•50+ JVMs (Oracle JDK6)
Tuesday, July 10, 12
4
Apa
che
Tom
cat
Ora
cle
B2Ce-commerce platform
•12+ Servers•10 different Webapps•50+ JVMs (Oracle JDK6)
•> 30000 sessions•250-400 Req/s•Variance is high
Tuesday, July 10, 12
... an unusual victim...
5
Product catalog modeled as a Graph100% custom implementation
100% on-heap (no SQL except for initial load)in-place update by AtomicReference.set()
Tuesday, July 10, 12
... an unusual victim...
5
Product catalog modeled as a Graph100% custom implementation
100% on-heap (no SQL except for initial load)in-place update by AtomicReference.set()
Caching aggressively is not possibleLarge number of request-scoped objects
Many WS into backoffice systems = latency
Tuesday, July 10, 12
6
vs.
Throughput Latency
Tuesday, July 10, 12
7Tuesday, July 10, 12
7
Interactive e-commerce app:Low latency is the top
priority!
Tuesday, July 10, 12
The Crime Scene
8
Time
JDBC Connections
Tuesday, July 10, 12
The Crime Scene
8
Time
JDBC Connections
Time
Requests/s
Tuesday, July 10, 12
The Crime Scene
8
Time
JDBC Connections
Time
Requests/s
Time
Active threads
Tuesday, July 10, 12
The Crime Scene
8
Time
JDBC Connections
Time
Requests/s
Time
Active threads
Time
HTTP Executor Queue Size
Tuesday, July 10, 12
The evidence
9
Size
in M
B
1 hour
Heap
Tuesday, July 10, 12
The evidence
9
Size
in M
B
1 hour
Heap
Can’t see anything: let’s zoom out!
Tuesday, July 10, 12
The evidence
10
Size
in M
B
24 hours
Heap
Tuesday, July 10, 12
The evidence
10
Size
in M
B
24 hours
Heap
Tuesday, July 10, 12
The evidence
10
Size
in M
B
24 hours
Heap
Tuesday, July 10, 12
0
25
50
75
100
1 hour
The evidence
11
Time spent in GC (%)
Size
in M
B
1 hour
Heap
Tuesday, July 10, 12
The usual suspects...
12Tuesday, July 10, 12
The usual suspects...
• OutOfMemory Heap
12Tuesday, July 10, 12
The usual suspects...
• OutOfMemory Heap
• OutOfMemory PermGen
12Tuesday, July 10, 12
The usual suspects...
• OutOfMemory Heap
• OutOfMemory PermGen
• Long GC pauses
12Tuesday, July 10, 12
The usual suspects...
• OutOfMemory Heap
• OutOfMemory PermGen
• Long GC pauses
➡ under high load = immediate death
12Tuesday, July 10, 12
The usual suspects...
• OutOfMemory Heap
• OutOfMemory PermGen
• Long GC pauses
➡ under high load = immediate death
12Tuesday, July 10, 12
The usual suspects...
• OutOfMemory Heap
• OutOfMemory PermGen
• Long GC pauses
➡ under high load = immediate death
12Tuesday, July 10, 12
The usual suspects...
• OutOfMemory Heap
• OutOfMemory PermGen
• Long GC pauses
➡ under high load = immediate death
12
Death by
pauses
Tuesday, July 10, 12
Why do we need this GC thing again ?
13
“Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free.”
Cliff Click
Tuesday, July 10, 12
Fine, we just need to tune the JVM, right?...
14Tuesday, July 10, 12
Fine, we just need to tune the JVM, right?...
14Tuesday, July 10, 12
Fine, we just need to tune the JVM, right?...
14
POP QUIZZ!Number of command-line flags*?
* Oracle JVM 1.6.0_31 x86_64 server
Tuesday, July 10, 12
Fine, we just need to tune the JVM, right?...
14
POP QUIZZ!Number of command-line flags*?
* Oracle JVM 1.6.0_31 x86_64 server
less than 100 flags
Tuesday, July 10, 12
Fine, we just need to tune the JVM, right?...
14
POP QUIZZ!Number of command-line flags*?
* Oracle JVM 1.6.0_31 x86_64 server
less than 100 flags100 <= X< 200
Tuesday, July 10, 12
Fine, we just need to tune the JVM, right?...
14
POP QUIZZ!Number of command-line flags*?
* Oracle JVM 1.6.0_31 x86_64 server
less than 100 flags100 <= X< 200200 <= X< 300
Tuesday, July 10, 12
Fine, we just need to tune the JVM, right?...
14
POP QUIZZ!Number of command-line flags*?
* Oracle JVM 1.6.0_31 x86_64 server
less than 100 flags100 <= X< 200200 <= X< 300300 <= X< 400
Tuesday, July 10, 12
Fine, we just need to tune the JVM, right?...
14
POP QUIZZ!Number of command-line flags*?
* Oracle JVM 1.6.0_31 x86_64 server
less than 100 flags100 <= X< 200200 <= X< 300300 <= X< 400400 <= X< 500
Tuesday, July 10, 12
Fine, we just need to tune the JVM, right?...
14
POP QUIZZ!Number of command-line flags*?
* Oracle JVM 1.6.0_31 x86_64 server
less than 100 flags100 <= X< 200200 <= X< 300300 <= X< 400400 <= X< 500500 <= X< 600
Tuesday, July 10, 12
Fine, we just need to tune the JVM, right?...
14
POP QUIZZ!Number of command-line flags*?
* Oracle JVM 1.6.0_31 x86_64 server
less than 100 flags100 <= X< 200200 <= X< 300300 <= X< 400400 <= X< 500500 <= X< 600600 <= X< 700
Tuesday, July 10, 12
Fine, we just need to tune the JVM, right?...
14
POP QUIZZ!Number of command-line flags*?
* Oracle JVM 1.6.0_31 x86_64 server
less than 100 flags100 <= X< 200200 <= X< 300300 <= X< 400400 <= X< 500500 <= X< 600600 <= X< 700 664 Flags!
Tuesday, July 10, 12
15Tuesday, July 10, 12
15Tuesday, July 10, 12
JVM
Memory in the JVM
16
Tuesday, July 10, 12
JVM
17
Permanent (PermGen) Class metadatainterned Strings, etc.
Tuesday, July 10, 12
Heap
18
Permanent (PermGen)
Application Objects
Class metadatainterned Strings, etc.
Tuesday, July 10, 12
19
Young / New
Old / Tenured
Permanent (PermGen)
Tuesday, July 10, 12
20
Old / Tenured
Eden S0 S1
Permanent (PermGen)
Tuesday, July 10, 12
The Garbage Collector is generational
21
Old
Eden Sur
vivor 0
Surviv
or 1
Tuesday, July 10, 12
22
Old
Eden Sur
vivor 0
Surviv
or 1
Allocation
Tuesday, July 10, 12
23
Old
Eden Sur
vivor 0
Surviv
or 1
Tuesday, July 10, 12
24
Old
Eden Sur
vivor 0
Surviv
or 1
100% = GC!
Tuesday, July 10, 12
25
Old
Eden Sur
vivor 0
Surviv
or 1
LiveUnreferenced
Tuesday, July 10, 12
26
Old
Eden Sur
vivor 0
Surviv
or 1
Copy
Tuesday, July 10, 12
Reset...
27
Old
Eden Sur
vivor 0
Surviv
or 1
Tuesday, July 10, 12
28
Old
Eden Sur
vivor 0
Surviv
or 1
Allocation
Tuesday, July 10, 12
29
Old
Eden Sur
vivor 0
Surviv
or 1
100% = GC !
Tuesday, July 10, 12
30
Old
Eden Sur
vivor 0
Surviv
or 1
Tuesday, July 10, 12
31
Old
Eden Sur
vivor 0
Surviv
or 1
Copy
Tuesday, July 10, 12
32
Old
Eden Sur
vivor 0
Surviv
or 1
Copy
Tuesday, July 10, 12
33
Reset
Old
...
Eden Sur
vivor 0
Surviv
or 1
Génération 1
Génération 2
Tuesday, July 10, 12
34
Old
Eden Sur
vivor 0
Surviv
or 1
Allocation
Tuesday, July 10, 12
35
Old
Eden Sur
vivor 0
Surviv
or 1
100% = GC !
Tuesday, July 10, 12
36
Old
Eden Sur
vivor 0
Surviv
or 1
Copy
Tuesday, July 10, 12
37
Eden Sur
vivor 0
Surviv
or 1
Promotion
Old
Tuesday, July 10, 12
Old
38Tuesday, July 10, 12
39
Old“Almost full” = GC !
Tuesday, July 10, 12
40Tuesday, July 10, 12
41Tuesday, July 10, 12
42
Old
Compaction(optional)
Tuesday, July 10, 12
43Tuesday, July 10, 12
Garbage Collectors
44
•Générational
• Stop the world!
• Throughput or Concurrent
Tuesday, July 10, 12
GC characteristics
45
YoungYoung
OldOldOld
Serial Parallel
Serial
Parallel
Concurrent
Tuesday, July 10, 12
GC characteristics
46
YoungYoung
OldOldOld
Serial Parallel
Serial Default
Parallel N/A
Concurrent
Tuesday, July 10, 12
GC characteristics
47
YoungYoung
OldOldOld
Serial Parallel
Serial
Parallel
Concurrent
Tuesday, July 10, 12
GC characteristics
47
YoungYoung
OldOldOld
Serial Parallel
Serial
Parallel
Concurrent
Serial
Tuesday, July 10, 12
GC characteristics
47
YoungYoung
OldOldOld
Serial Parallel
Serial
Parallel
Concurrent
Serial Parallel
Tuesday, July 10, 12
GC characteristics
47
YoungYoung
OldOldOld
Serial Parallel
Serial
Parallel
Concurrent
Serial Parallel
ParallelOld
Tuesday, July 10, 12
GC characteristics
47
YoungYoung
OldOldOld
Serial Parallel
Serial
Parallel
Concurrent
Serial Parallel
ParallelOld
CMS
Tuesday, July 10, 12
GC characteristics
47
YoungYoung
OldOldOld
Serial Parallel
Serial
Parallel
Concurrent
Serial Parallel
ParallelOld
CMSCMS Serial
Tuesday, July 10, 12
GC characteristics
48
YoungYoung
OldOldOld
Serial Parallel
Serial Serial Parallel
Parallel ParallelOld
Concurrent CMS Serial CMS
Parallel implementation actually differ for each variant
Tuesday, July 10, 12
GC characteristics
49Tuesday, July 10, 12
GC characteristics
49Tuesday, July 10, 12
CMS is the right choice
50
Serial
Parallel
ParallelOld
CMS
CMS Serial
0 250 500 750 1000
937
871
846
852
917
Average test duration (s)
Tuesday, July 10, 12
Tools: CLI
51
jps, jhat, jmap, jstack, jstat
$ jstat -gcutil PID S0 S1 E O P YGC YGCT FGC FGCT GCT 0.00 40.88 58.41 18.34 66.65 2729 316.538 46 6.820 323.358
Tuesday, July 10, 12
Tools: GUIs
52Tuesday, July 10, 12
Tools: GUIs (2)
•Any profiler
•During development
• For autopsies!
53Tuesday, July 10, 12
Tools: GUIs (2)
•Any profiler
•During development
• For autopsies!
53
HeapDumpOnOutOfMemoryErrorHeapDumpPath
Tuesday, July 10, 12
verbose:gc
54Tuesday, July 10, 12
verbose:gc
54Tuesday, July 10, 12
verbose:gc
54Tuesday, July 10, 12
verbose:gc
54Tuesday, July 10, 12
verbose:gc
54Tuesday, July 10, 12
verbose:gc
54Tuesday, July 10, 12
verbose:gc
54Tuesday, July 10, 12
verbose:gc
54
Stop the world!
Tuesday, July 10, 12
verbose:gc
54
Stop the world!
Tuesday, July 10, 12
MBeans
55Tuesday, July 10, 12
OK, so we can measure... temperature!!
56Tuesday, July 10, 12
OK, so we can measure... temperature !
57
=
Tuesday, July 10, 12
58Credit: http://www.lhup.edu/mkhalequ/fieldtrip/geos253.htm
But...a single temperature measure is not enough to diagnose anything!
We must archive all measurementsto know the baseline!
Tuesday, July 10, 12
Therefore we must persist all measurements!
• JMX + jmxtrans
• RRD
• Graphite
• etc.
59Tuesday, July 10, 12
Operating the (many) switches only makes sense...
60Credit: http://www.our-energy.com
Tuesday, July 10, 12
...if we can measure/compare the effects!
61
Before
After
cput
ime
Tuesday, July 10, 12
Putting it together
62Tuesday, July 10, 12
63
We want to minimize the GC pauses
Young (ParNew)Old (CMS-initial-mark + CMS-remark)
Tuesday, July 10, 12
64
vs.
Tuesday, July 10, 12
JVM
Tomcat
64
Application(code)
vs.
Tuesday, July 10, 12
1. Code
• Tuning the JVM cannot compensate for bad code
• Rules of thumb
• Immutability = object reuse = less allocations *
• Move code invariants out of tight loops
• Know the characteristics of your data structures & frameworks (java.util, Guava, Hibernate, etc.)
• Mind the gap: data structure overhead can kill you!
65* But...pooling can be counter-productive!
Tuesday, July 10, 12
Example : HashMap
66
HashMap
Entry[16]
Entry
value
key
48
80
32
Tuesday, July 10, 12
Example : HashMap
66
HashMap
Entry[16]
Entry
value
key
48
80
32Overhead = 160 Bytes!
Tuesday, July 10, 12
Example : HashMap
66
HashMap
Entry[16]
Entry
value
key
48
80
32Overhead = 160 Bytes!
•SingletonMap (40 Bytes)•initialCapacity + loadFactor
Tuesday, July 10, 12
Less allocations...
67
GC
You
ng /
s
Tuesday, July 10, 12
... saves CPU
68
Cha
rge
CPU
Tuesday, July 10, 12
2. Tomcat
• Pooling
• JSP tags: enablePooling in web/webdefault.xml
• -Dorg.apache.jasper.runtime.JspFactoryImpl.USE_POOL=false
• Careful with buffers and their reuse
• -Dorg.apache.jasper.runtime.BodyContentImpl.LIMIT_BUFFER=true
• JSP usage is a factor in PermGen requirements
• Test & Measure, always!
69Tuesday, July 10, 12
2. Tomcat
• Pooling
• JSP tags: enablePooling in web/webdefault.xml
• -Dorg.apache.jasper.runtime.JspFactoryImpl.USE_POOL=false
• Careful with buffers and their reuse
• -Dorg.apache.jasper.runtime.BodyContentImpl.LIMIT_BUFFER=true
• JSP usage is a factor in PermGen requirements
• Test & Measure, always!
69
!Pooling may lead
to Old fragmentation!
Tuesday, July 10, 12
3. The JVM
70
Hea
p Si
ze
Time
Hea
p Si
ze
Time
Tuesday, July 10, 12
3. The JVM
70
Hea
p Si
ze
Time
Hea
p Si
ze
Time
pause > 1s !
Tuesday, July 10, 12
3. The JVM
70
Hea
p Si
ze
Time
Hea
p Si
ze
Time
Frequent GC
pause > 1s !
Tuesday, July 10, 12
The heap
71
Heap
Tuesday, July 10, 12
The heap
71
Heap-Xms : start size-Xmx : max size
Tuesday, July 10, 12
Young vs Old
72
Young
Old
Tuesday, July 10, 12
Young vs Old
72
Young
Old
-XX:NewSize -XX:MaxNewSize-XX:NewRatio
Tuesday, July 10, 12
Young vs Old
72
Young
Old•“Working Set”•Caches, object pools•HttpSession, average lifespan objects
Objects < RequestScope
Tuesday, July 10, 12
First mistake: setting the Young too small
73
Young
Old
Tuesday, July 10, 12
First mistake: setting the Young too small
73
Young
Old
Young fills up quickly = many GC Young
Objects promoted to Tenured too fast = many GC Old
Tuesday, July 10, 12
Second mistake: setting Young too large
74
Young
Old
Tuesday, July 10, 12
Second mistake: setting Young too large
74
Young
Old
GC Young pauses increase
Tuesday, July 10, 12
Tuning Young
75
Young
Old
Default NewRatio=8 with -server on Intel
=Too small for a webapp with non-
trivial load!
Tuesday, July 10, 12
Tuning Young
75
Young
Old
Increase Young slowly and measure the effects!
Tuesday, July 10, 12
Old: Mind the Gaps (fragmentation)!
76
Young
Old
Tuesday, July 10, 12
Old: Mind the Gaps (fragmentation)!
76
Young
OldJDK6 < u22:
ParNew (prom
otion failur
e size = 155
95) (promoti
on failed)
Tuesday, July 10, 12
Old generation : ideal shape
77Tuesday, July 10, 12
Old generation : real life
78Tuesday, July 10, 12
Old generation : ideal vs. real
79Tuesday, July 10, 12
Old generation : ideal vs. real
79
Rate increases
Tuesday, July 10, 12
Things to watch for
• Traffic/Load variance
• Traffic increases => Memory pressure increase
• CMS requires some headroom to operate properly
• Several phases are concurrent, i.e. at the same time as new objects are allocated
80Tuesday, July 10, 12
Things to watch for
• Traffic/Load variance
• Traffic increases => Memory pressure increase
• CMS requires some headroom to operate properly
• Several phases are concurrent, i.e. at the same time as new objects are allocated
80
(concurrent mode failure): 2165740K->1284261K(2228224K), 8.9411250 secs
Tuesday, July 10, 12
Giving CMS some room to operate
81
Young
Old
Tuesday, July 10, 12
Giving CMS some room to operate
81
Young
Old
CMSInitiatingOccupancyFraction = 92%
This is the default....
Tuesday, July 10, 12
We really need 75-80%
UseCMSInitiatingOccupancyOnly to force the JVM to only consider this criteria
Giving CMS some room to operate
81
Young
Old
Tuesday, July 10, 12
82
CMS initial-mark
Tuesday, July 10, 12
83
CMS initial-mark (cumulative)
Tuesday, July 10, 12
83
CMS initial-mark (cumulative)
Median: -83%
Tuesday, July 10, 12
83
CMS initial-mark (cumulative)
Median: -83%
Top 99%: -79%
Tuesday, July 10, 12
84
CMS remark
Tuesday, July 10, 12
85
CMS remark (cumulative)
Tuesday, July 10, 12
85
CMS remark (cumulative)
Top 90%: -56%
Tuesday, July 10, 12
But...I still see pauses !
• RMI triggers explicit GC regularly
• Invokes System.gc()
• Explicit GC = Full GC (Serial) = 4-8s stop-the-world pause !
• DisableExplicitGC + CMSClassUnloadingEnabled
• ExplicitGCInvokesConcurrentAndUnloadsClasses
86Tuesday, July 10, 12
Complete GC comparison
87Tuesday, July 10, 12
88Tuesday, July 10, 12
88Tuesday, July 10, 12
• Survivors tuning (S0 & S1)
• Size, ratio vs. Eden, max generation
• G1
• Principles and operations are radically different!
• Other JVMs : JRockit, Azul, IBM
• Check tuning validity after every code change!
• Measure, measure, measure!
89
What’s next?
Tuesday, July 10, 12
90
Questions ?
Tuesday, July 10, 12
91Tuesday, July 10, 12