hotspot aot internals and performance results - rainfocus€¦ · hotspot aot internals and...

40
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | HotSpot AOT Internals and performance results Vladimir Kozlov Igor Veresov HotSpot Compiler Team, Oracle October 5, 2017 1

Upload: vucong

Post on 13-Apr-2018

222 views

Category:

Documents


2 download

TRANSCRIPT

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

HotSpot AOT Internals and performance results

Vladimir KozlovIgor Veresov

HotSpot Compiler Team, OracleOctober 5, 2017

1

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor StatementThe following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

2

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT motivations

• Needed for longer term strategy of supporting Future Java based JIT compiler.• Provide faster startup for applications since hot methods and class Initializers will

be readily available. Expected to be important for Cloud.• Provide quicker time to peak performance. Statically generated code could be

1st pass in a multi-tiered compilation system.• Density improvement - sharing AOT’d code, app dependent.• “Prevent global warming”. AOT uses much less CPU power by running compiled

code instead of interpreting from start. And non-tiered (static) AOT excludes JIT compilations and profiling for corresponding java methods.

3

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT functionality overviewNew JDK tool jaotc is used for AOT compilation. It uses Graal-core as the code-generating backend. In JDK 9 Libelf was used to produce AOT shared libraries in ELF format. In JDK 10 we removed dependency on Libelf and added support for macOS and Windows.To use jaotc user have to specify list of .class, .jar files or java module names as input and resulting AOT library name as output (unnamed.so is used if name is not specified):

jaotc --output libHelloWorld.so HelloWorld.class jaotc --output libjava.base.and.javac.so --module java.base:jdk.compiler

User can specify which methods to compile or exclude with --compile-commands flag

jaotc --output libjava.base.so --compile-commands base.txt —module java.base

The command file can have 2 commands: exclude or compileOnly

4

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT functionality overviewAOT code can be compiled in two modes controlled by --compile-for-tiered flag (by default it is off currently):

• Non-tiered AOT compiled code behaves similarly to statically compiled C++ code (or C1 JIT compiled code in Client VM), in that no profiling information is collected and no JIT recompilations will happen if AOT code is not deoptimized.

• Tiered AOT compiled code does collect profiling information. The profiling done is the same as the simple profiling (invocation + back-branch counters) done by C1 methods compiled at Tier 2. If AOT methods hit the AOT invocation thresholds then these methods are recompiled by C1 at Tier 3 first in order to gather full profiling information. This is required for C2 JIT recompilations in order to produce optimal code and reach peak application performance.

5

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT functionality overviewCurrently the same JVM version and runtime configuration should be used during AOT compilation:

jaotc -J-XX:+UseParallelGC -J-XX:-UseCompressedOops --output libHelloWorld.so HelloWorld.class java -XX:+UseParallelGC -XX:-UseCompressedOops -XX:AOTLibrary=./libHelloWorld.so HelloWorld

They are recorded in AOT library and verified during execution. If verification failed this AOT library will not be used and JVM will continue run or exit if diagnostic flag -XX:+UseAOTStrictLoading is specified.

AOT recompilation is required when Java is updated.

6

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT functionality overviewAOT tool jaotc does not resolve referenced classes which are not system classes or part of compiled classes.Referenced classes have to be added to class path: jaotc --output=libfoo.so --jar foo.jar -J-cp -J./

or additional java modules are specified: jaotc --output=libactivation.so --module java.activation -J--add-module=java.se.ee

Otherwise ClassNotFoundException could be thrown during AOT compilation.

7

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT functionality overviewDuring JVM startup AOT initialization code looks for well-known AOT libraries in well-known location ($JAVA_HOME/lib) or libraries specified by -XX:AOTLibrary JVM flag. If shared libraries are found, these libraries are loaded and used. If no shared libraries can be found, AOT will be turned off.

java -XX:AOTLibrary=./libHelloWorld.so,./libjava.base.so HelloWorld

JVM knows AOT libraries names for next Java modules.:

java.base, jdk.compiler (javac), jdk.scripting.nashorn, jdk.internal.vm.ci (JVMCI), jdk.internal.vm.compiler (Graal)

Note, user himself have to compile and install well-known AOT libraries.

8

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT functionality overviewThe set of AOT libraries could be generated for different execution environment. JVM knows next well-known names for AOT libraries generated for specific runtime configuration. It will look for them in $JAVA_HOME/lib directory and load the one which correspond to current run-time configuration:

-XX:-UseCompressedOops -XX:+UseG1GC : libjava.base.so-XX:+UseCompressedOops -XX:+UseG1GC : libjava.base-coop.so-XX:-UseCompressedOops -XX:+UseParallelGC : libjava.base-nong1.so-XX:+UseCompressedOops -XX:+UseParallelGC : libjava.base-coop-nong1.so

9

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT functionality overview• Code sections in AOT library are treated by JVM as extension of existing CodeCache.

When Java class is loaded JVM looks if corresponding AOT-compiled methods exist in loaded AOT libraries and add links to them from java methods descriptors.

• AOT-compiled code follows the same invocation/deoptimization/unloading rules as normal JIT-compiled code.

• To detect changes in classes AOT uses class fingerprinting. During AOT compilation fingerprint for each class is generated and stored in AOT library. During execution, when a class is loaded and AOT-compiled code is found for this class, fingerprint for class is compared to one stored in AOT library. If there is mismatch then AOT code for that particular class is not used (aot code marked non-entrant).

10

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Graal changes• Indirect load of constants, including constant replacement• Class initialization• Profiling (tiered compilation support)• Inlining

11

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Constants• Constants embedded in the code (field offsets, GC barriers, etc) — need to be

validated before any code is run.• Global constants (heap top/end, card table base, etc) — eagerly initialized when

the library is loaded. GraalHotSpotVMConfigNode represents these values. Folds to constants in JIT mode. Indirect loads in AOT.• Local constants (classes, objects, method counters) — lazily initialized at

runtime.

12

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Constant replacement• Automatic — constants are replaced by nodes that provide indirection and

handle lazy resolution if necessary: ReplaceConstantNodesPhase.• Replaces classes, method counters and string constants with

ResolveConstantNode and ResolveMethodAndLoadCountersNode.• Some well-known class constants are eagerly resolved, replaced with

LoadConstantIndirectlyNode (for example primitive array classes).• Class mirror constants are replaced with indirections though class constants:

LoadJavaMirrorWithKlassPhase.

13

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Constant replacement optimizations• Currently single resolution for each constant (placed in a dominating block).• Reuse of dominating class initialization nodes (InitializeKlassNode).• Resolution of root method holder class and its superclasses can be omitted since

it’s guaranteed to be initialized (and hence resolved). Replaced by LoadConstantIndirectlyNode.

14

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Constant replacementLater all these nodes are lowered into snippets that do (for example for class constants):

KlassPointer result = LoadConstantIndirectlyNode.loadKlass(constant); if (probability(VERY_SLOW_PATH_PROBABILITY, result.isNull())) { result = ResolveConstantStubCall.resolveKlass(constant, EncodedSymbolNode.encode(constant)); }

15

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Class initialization• InitializeKlassNode is inserted at every initialization point through new parser

plugin interface ClassInitializationPlugin (see HotSpotClassInitializationPlugin for implementation).• Some optimizations are possible at parsing phase (type >=: holder, except for

interfaces).• Separate phase (EliminateRedundantInitializationPhase) with data flow analysis

to remove redundant class initialization nodes. Good after loop peeling.• Lowered into if-null-then-call-runtime diamond.

16

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Tiered support• Similar to level 2 profiling, higher thresholds.• Counting invocation and back branches, calling back to runtime to re-JIT.• Profiling nodes are inserted in the parser via new plugin interface ProfilingPlugin

(see HotSpotProfilingPlugin for implementation).• Later processed in FinalizeProfileNodesPhase. Assigns inlinee notification

frequencies, and random sources (more about this later).• Profiling nodes can be smushed together if profiling the same thing and are in

straight-line control flow (good with loop unrolling).

17

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Kinds of profiling• Profiling nodes are lowered in two ways. Normal profiling and probabilistic

profiling.• Normal profiling is what you’d expect (see ProfileSnippets).• Probabilistic profiling tries to minimize cache line ping-ponging (see

ProbabilisticProfileSnippets).

18

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Probabilistic profiling• Threads executing same methods are competing for the same cache line, where

the counters are.• The idea is to not do increments every time, but do it with some predefined

probability. • Branch on random: if (random() & ((1 << prob_log) - 1) == 0) { …. }

19

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Probabilistic profilingif ((random() & ((1 << probLog) - 1)) == 0) { // branch on random int counterValue = counters.readInt(invocationCounterOffset) + (invocationCounterIncrement << probLog); counters.writeInt(invocationCounterOffset, counterValue); if (freqLog >= 0) { // folds int probabilityMask = (1 << probLog) - 1; int frequencyMask = (1 << freqLog) - 1; int mask = frequencyMask & ~probabilityMask; if (counterValue & (mask << invocationCounterShift)) == 0)) { methodInvocationEvent(counters); } } }

20

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Probabilistic profiling• Start with a cheap random source.• RandomSeedNode, which lowers to rdtsc on x64.• For loops it’s too expensive - so, inject linear congruential generators into loops:

Xn+1 = a * Xn + c

• Injection is also done in FinalizeProfileNodesPhase. Lowered with ProbabilisticProfileSnippets.• Almost anything is better than a cache miss!• Up to 30% speedup on NUMA machines.

21

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Inlining• Special inlining policy. No profiling, allows for less depth.• We don’t support CHA dependency validation, some CHA assumptions (single

leaf) are converted to runtime checks.• Otherwise inlining is only for exact types.

22

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT compilation phases• Parse flags

• Create list of java methods to compile:–Collect Java classes to compile based on flags which specify classes,

modules and .jar files: --class-name, --module, --jar, --directory–Filter Java methods from collected classes based on commands specified in

--compile-commands file. It uses regular expressions to specify methods:exclude sun.security.ssl.*

• Use Graal to compile collected java methods:–Create up to 16 Graal compiler threads depending on cpu count–Add methods to compilation queue and wait when all are compiled

23

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT compilation phases• Parse compiled code:– record location of calls in compiled code– copy compiled code into code buffer and write call stubs after code. – replace calls destinations with calls to call stubs (trampolines) which will load

destination address from RW section and jump there• Process compiled code metadata:– record location of calls, constant and oops (Strings) references– allocate memory for class and method pointers (GOT cells) referenced in code; for

tiered AOT allocate memory for method counters pointers– store names of classes and methods and strings in UTF8 format to use them during

execution for resolution and allocation.• Call into runtime through JVMCI to get class fingerprint and compiled metadata.

24

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT compilation phases• Create AOT header section which contains general information about

compilation, runtime configuration and Java version.

• Create object file:– convert buffers with code and collected data into elf sections – record referenced symbols– record relocation data for loader

• Generate shared AOT library using linker. For example, for Linux-x64:– ld -shared -z noexecstack -o lib.so lib.o

25

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT library sectionsAvailable ELF sections in AOT library can be looked with readelf (Linux):

readelf -S -W libHelloWorld.so [ 5] .text             PROGBITS    0000000000002380 002380 005a80 00  AX  0   0 128[ 6] .metaspace.names  PROGBITS    0000000000007e00 007e00 0007ea 00   A  0   0 128[ 7] .klasses.offsets  PROGBITS    0000000000008600 008600 000150 00   A  0   0 128[ 8] .methods.offsets  PROGBITS    0000000000008780 008780 000034 00   A  0   0 128

readelf -p .header libHelloWorld.so String dump of section '.header':   [     8]  )   [    1d]  09-internal+0-2017-07-06-132808.uid.070617hs

26

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT examples• Manual test scripts to build AOT libraries and test AOT are located in Hotstpot

test directory. They can be used as AOT examples. README explains how to run tests:

hotspot/test/compiler/aot/scripts/README

27

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Startup performance with AOT and CDS• JDK10 used - latest jdk10/hs + additional AOT and CDS changes • Skylake machine provided by Intel was used for testing.• G1 GC was default GC.• Java heap size 4Gb. Compressed oops were enabled.• Number of GC and Compiler threads was limited to 4 each.• Application run was bound to one cpu node: numactl --cpubind=1 --membind=1.• Only touched methods were AOT compiled. Training run was done to get list of

used classes for CDS and list of touched methods for AOT.• One AOT library was built which combines java.base and application code.• This is work in progress - “your mileage may vary”.

28

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

0

20

40

60

80

default AOT CDS AOT+CDS

C1 (TieredStopAtLevel=1)Tiered C1 + C2Tiered C1 + GraalTiered C1 + Graal (2)

Smaller value is better

java.base static AOT for C1 casejava.base tiered AOT for C1+C2java.base+graal static AOT for C1+Graaljava.base+graal tiered AOT for C1+Graal (2)java.base CDSjava.base+graal CDS for C1+Graal (2)

29

‘real’ time (ms) to run `java HelloWorld`

C1+

Gra

alC

1+C

2C

1

java

.bas

e+G

raal

tier

ed A

OT

java

.bas

e+G

raal

sta

tic A

OT

java

.bas

e+G

raal

CD

Sja

va.b

ase

CD

S

java

.bas

e st

atic

AO

Tja

va.b

ase

tiere

d AO

T

java

.bas

e C

DS

java

.bas

e C

DS

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

0

20

40

60

80

default AOT CDS AOT+CDS

C1 (TieredStopAtLevel=1)Tiered C1 + C2Tiered C1 + GraalTiered C1 + Graal (2)

Smaller value is better

java.base static AOT for C1 casejava.base tiered AOT for C1+C2java.base+graal static AOT for C1+Graaljava.base+graal tiered AOT for C1+Graal (2)java.base CDSjava.base+graal CDS for C1+Graal (2)

30

‘user’ time (ms) to run `java HelloWorld`

C1+

Gra

alC

1+C

2C

1

java

.bas

e+G

raal

CD

S

java

.bas

e st

atic

java

.bas

e tie

red

java

.bas

e+G

raal

sta

ticja

va.b

ase+

Gra

al ti

ered

java

.bas

e C

DS

java

.bas

e C

DS

java

.bas

e C

DS

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

0

550

1100

1650

2200

default AOT CDS AOT+CDS

C1 (TieredStopAtLevel=1)Tiered C1 + C2Tiered C1 + GraalTiered C1 + Graal (2)

Smaller value is better

A) java.base+javac static AOTB) java.base+javac tiered AOTC) java.base+javac+graal static AOTD) java.base+javac+graal tiered AOTE) java.base+javac CDSF) java.base+javac+graal CDS

31

‘real’ time (ms) for javac to compile 182 JVMCI files

C1+

C2

C1+

Gra

al

C1

A B C D E E E F

A+E

B+E

C+E

D+F

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

0

2500

5000

7500

10000

default AOT CDS AOT+CDS

C1 (TieredStopAtLevel=1)Tiered C1 + C2Tiered C1 + GraalTiered C1 + Graal (2)

Smaller value is better

A) java.base+javac static AOTB) java.base+javac tiered AOTC) java.base+javac+graal static AOTD) java.base+javac+graal tiered AOTE) java.base+javac CDSF) java.base+javac+graal CDS

32

‘user’ time (ms) for javac to compile 182 JVMCI files

C1+

Gra

alC

1+C

2C

1 A B C D E E E F

A+E

B+E

D+F

C+E

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

0

500

1000

1500

2000

default AOT CDS AOT+CDS

C1 (TieredStopAtLevel=1)Tiered C1 + C2Tiered C1 + GraalTiered C1 + Graal (2)

Smaller value is better

A) java.base+jruby static AOTB) java.base+jruby tiered AOTC) java.base+jruby+graal static AOTD) java.base+jruby+graal tiered AOTE) java.base+jruby CDS

33

‘real’ time (ms) for jruby -e 'puts "Hello World"'

C1+

C2

C1+

Gra

al

C1

A B C D E E E

D+EC+E

A+E

B+E

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

0

2000

4000

6000

8000

default AOT CDS AOT+CDS

C1 (TieredStopAtLevel=1)Tiered C1 + C2Tiered C1 + GraalTiered C1 + Graal (2)

Smaller value is better

A) java.base+jruby static AOTB) java.base+jruby tiered AOTC) java.base+jruby+graal static AOTD) java.base+jruby+graal tiered AOTE) java.base+jruby CDS

34

‘user’ time (ms) for jruby -e 'puts "Hello World"'

C1+

Gra

alC

1+C

2C

1 A B C D E E E

A+E

B+E

D+E

C+E

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

SPECjvm2008 performance with AOT and CDS• JDK10 used - latest jdk10/hs + additional AOT and CDS changes • 4 threads (-bt 4), 3 iterations (-i 3), warmup 10 and 5 sec per iteration (-wt 10 -it 5).• Combined score was used for each run (all benchmarks were executed except

startup benchmarks).• Parallel GC was default GC. Java heap size 4Gb. Compressed oops were

enabled.• Number of GC and Compiler threads was limited to 4 each.• Application run was bound to one cpu node: numactl --cpubind=1 --membind=1.• Only touched methods were AOT compiled. Training run was done to get list of

used classes for CDS and list of touched methods for AOT.

35

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

0

80

160

240

320

default AOT CDS AOT+CDS

C1 (TieredStopAtLevel=1)Tiered C1 + C2Tiered C1 + GraalTiered C1 + Graal (2)

High value is better

A) java.base static AOTB) java.base tiered AOTC) java.base+graal static AOTD) java.base+graal tiered AOTE) java.base CDSF) java.base+graal CDS

36

Composite score (ops/m) for jvm2008

C1+

C2

C1+

Gra

al

C1

A B C D E E E

D+FC+E

A+E

B+EF

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Performance observations• For short running applications the best combination is C1 + CDS + static AOT.• We got best results if only touched methods are AOT compiled.• With tiered AOT application reaches the same peak performance as default JIT.• Static (non-tiered) AOT saves CPU resources by avoiding JIT compilations. • Code generated by JIT compiler Graal-core is only 10% below code generated

by C2 depending on benchmark.• It seems fine to have Graal-core code compiled as static (non-tiered) AOT.• CDS improves startup time a lot when many classes are loaded.• AOT and CDS benefits are incremental - improvements add up.

37

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

JDK 9 AOT limitations• AOT initial release in JDK 9 is provided for experimental-only use and is

restricted to Linux x64 systems running 64-bit Java with either Parallel or G1 GC.• AOT compilation must be executed on the same system or a system with the

same configuration on which AOT code will be used by Java application.• The same Java run-time configuration must be used during AOT compilation and

execution. Mismatching run-time configuration may cause application crash during execution.• May not compile java code which uses dynamically generated classes and

bytecode (lambda expressions, invoke dynamic).These limitations may be addressed in future releases.

38

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

AOT current work• We added support for macOS and Windows on x64.• Removed dependency on libelf tools by using java code for AOT object file

generation.• Investigating combination of AOT and CDS.• Working on invoke-dynamic support.• Optimizing generated AOT profiling (for tiered) code.

39

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 40

Q&A