java in high-performance computing
TRANSCRIPT
Java in High-Performance Computing
Dawid Weiss
Carrot SearchInstitute of Computing Science, Poznan University of Technology
GeeCon Poznan, 05/2010
Learn from the mistakes of others. You can’t live longenough to make them all yourself.
— Eleanor Roosevelt
Talk outline
• What is “High performance”?
• What is “Java”?
• Measuring performance (benchmarking).
• HPPC library.
Crosscutting: (un?)common pitfalls and performance killers. SomeHotSpot internals.
Talk outline
• What is “High performance”?
• What is “Java”?
• Measuring performance (benchmarking).
• HPPC library.
Crosscutting: (un?)common pitfalls and performance killers. SomeHotSpot internals.
Divide-and-conquerstyle algorithm
for (Example e : examples) {e.hasQuiz() ? e.showQuiz() : e.showCode();e.explain();e.deriveConclusions();
}
— PART I —
High PerformanceComputing
High-performance computing (HPC) usessupercomputers and computer clusters to solveadvanced computation problems.
— Wikipedia
Is Java faster than C/C++?The short answer is: it depends.
— Cliff Click
It’s usually hard to makea fast program run faster.
It’s easy to make a slowprogram run even slower.
It’s easy to make fasthardware run slow.
It’s usually hard to makea fast program run faster.
It’s easy to make a slowprogram run even slower.
It’s easy to make fasthardware run slow.
It’s usually hard to makea fast program run faster.
It’s easy to make a slowprogram run even slower.
It’s easy to make fasthardware run slow.
For now, HPC
• limited allowed computation time,
• constrained resources (hardware, memory).
Good HPC software ∝ no (obvious) flaws.
For now, HPC
• limited allowed computation time,
• constrained resources (hardware, memory).
Good HPC software ∝ no (obvious) flaws.
— PART II —
What is Java?
(Recall: Is Java faster than C/C++?)
Example 1
public void testSum1() {int sum = 0;for (int i = 0; i < COUNT; i++)
sum += sum1(i, i);result = sum;
}
public void testSum2() {int sum = 0;for (int i = 0; i < COUNT; i++)
sum += sum2(i, i);result = sum;
}
where the body of sum1 and sum2 sums arguments and returns theresult and COUNT is significantly large. . .
Example 1
public void testSum1() {int sum = 0;for (int i = 0; i < COUNT; i++)
sum += sum1(i, i);result = sum;
}
public void testSum2() {int sum = 0;for (int i = 0; i < COUNT; i++)
sum += sum2(i, i);result = sum;
}
where the body of sum1 and sum2 sums arguments and returns theresult and COUNT is significantly large. . .
VM sum1 sum2
sun-1.6.0-20
0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29
ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16
harmony-r917296 0.17 0.35
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM sum1 sum2
sun-1.6.0-20 0.04
2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29
ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16
harmony-r917296 0.17 0.35
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM sum1 sum2
sun-1.6.0-20 0.04 2.62sun-1.6.0-16
0.04 3.20sun-1.5.0-18 0.04 3.29
ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16
harmony-r917296 0.17 0.35
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM sum1 sum2
sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18
0.04 3.29ibm-1.6.2 0.08 6.28
jrockit-27.5.0 0.18 0.16harmony-r917296 0.17 0.35
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM sum1 sum2
sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29
ibm-1.6.2
0.08 6.28jrockit-27.5.0 0.18 0.16
harmony-r917296 0.17 0.35
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM sum1 sum2
sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29
ibm-1.6.2 0.08 6.28jrockit-27.5.0
0.18 0.16harmony-r917296 0.17 0.35
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM sum1 sum2
sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29
ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16
harmony-r917296
0.17 0.35
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM sum1 sum2
sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29
ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16
harmony-r917296 0.17 0.35
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM sum1 sum2 sum3 sum4
sun-1.6.0-20 0.04 2.62 1.05 3.76sun-1.6.0-16 0.04 3.20 1.39 4.99sun-1.5.0-18 0.04 3.29 1.46 5.20
ibm-1.6.2 0.08 6.28 0.16 14.64jrockit-27.5.0 0.18 0.16 1.16 3.18
harmony-r917296 0.17 0.35 9.18 22.49
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
int sum1(int a, int b) {return a + b;
}
Integer sum2(Integer a, Integer b) {return a + b;
}
↓
Integer sum2(Integer a, Integer b) {return Integer.valueOf(
a.intValue() + b.intValue());}
int sum3(int... args) {int sum = 0;for (int i = 0; i < args.length; i++)
sum += args[i];return sum;
}
Integer sum4(Integer... args) {int sum = 0;for (int i = 0; i < args.length; i++) {
sum += args[i];}return sum;
}
↓
Integer sum4(Integer [] args) {// ...
}
Conclusions
• Syntactic sugar may be costly.
• Primitive types are fast.
• Large differences between different VMs.
Example 2
Write once, run anywhere!
But it’s the same VM!
It works on my machine!
private static boolean ready;
public static void startThread() {new Thread() {
public void run() {try {
sleep(2000);} catch (Exception e) { /* ignore */ }System.out.println("Marking loop exit.");ready = true;
}}.start();
}
public static void main(String[] args) {startThread();System.out.println("Entering the loop...");while (!ready) {
// Do nothing.}System.out.println("Done, I left the loop!");
}
while (!ready) {// Do nothing.
}≡?
boolean r = ready;while (!r) {
// Do nothing.}
In most cases true, from a JMM perspective.
while (!ready) {// Do nothing.
}≡?
boolean r = ready;while (!r) {
// Do nothing.}
In most cases true, from a JMM perspective.
JVM Internals. . .
C1:
• fast
• not (much) optimization
C2:
• slow(er) than C1
• a lot of JMM-allowed optimizations
There are hundreds of JVMtuning/diagnostic switches.
My personal favorite:
Conclusions
• Bytecode is far from what is executed.
• A lot going on under the (VM) hood.
• Bad code may work, but will eventually crash.
• HotSpot-level optimizations are good.
• If there is a bug in the HotSpot compiler. . .
Conclusions
• Bytecode is far from what is executed.
• A lot going on under the (VM) hood.
• Bad code may work, but will eventually crash.
• HotSpot-level optimizations are good.
• If there is a bug in the HotSpot compiler. . .
Any other diversifyingfactors?
J2ME
• more VM vendors,
• hardware diversity,
• software and hardware quirks.
Non-JVM target platforms
• Dalvik
• GWT
• IKVM
Conclusions
• There is no “single” Java performance model.
• Performance depends on the VM,environment, class library, hardware.
• Apply benchmark-and-correct cycle.
Benchmarking
Example 3
public void testSum1() {int sum = 0;for (int i = 0; i < COUNT; i++)
sum += sum1(i, i);result = sum;
}
public void testSum1_2() {int sum = 0;for (int i = 0; i < COUNT; i++)
sum += sum1(i, i);}
VM sum1 sum1_2
sun-1.6.0-20
0.04 0.00sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00
ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08
harmony-r917296 0.17 0.11
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM sum1 sum1_2
sun-1.6.0-20 0.04
0.00sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00
ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08
harmony-r917296 0.17 0.11
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM sum1 sum1_2
sun-1.6.0-20 0.04 0.00
sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00
ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08
harmony-r917296 0.17 0.11
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM sum1 sum1_2
sun-1.6.0-20 0.04 0.00sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00
ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08
harmony-r917296 0.17 0.11
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ...
- method holder: ’com/dawidweiss/geecon2010/Example03’- access: 0xc1000001 public- name: ’testSum1_2’
...010 pushq rbp
subq rsp, #16 # Create framenop # nop for patch_verified_entry
016 addq rsp, 16 # Destroy framepopq rbptestl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC
021 ret
java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ...
- method holder: ’com/dawidweiss/geecon2010/Example03’- access: 0xc1000001 public- name: ’testSum1_2’
...010 pushq rbp
subq rsp, #16 # Create framenop # nop for patch_verified_entry
016 addq rsp, 16 # Destroy framepopq rbptestl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC
021 ret
Conclusions
• Benchmarks must be executed to providefeedback.
• HotSpot is smart and effective at removingdead code.
Example 4
@Testpublic void testAdd1() {
int sum = 0;for (int i = 0; i < COUNT; i++) {
sum += add1(i);}guard = sum;
}
public int add1(int i) {return i + 1;
}
Note add1 is virtual.
switch testAdd1
-XX:+Inlining -XX:+PrintInlining 0.04-XX:-Inlining ?
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200, JRE 1.7b80-debug).
switch testAdd1
-XX:+Inlining -XX:+PrintInlining 0.04-XX:-Inlining 0.45
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200, JRE 1.7b80-debug).
Most Java calls aremonomorphic.
HotSpot adjusts tomegamorphic calls
automatically.
Example 5
abstract class Superclass {abstract int call();
}
class Sub1 extends Superclass{ int call() { return 1; } }
class Sub2 extends Superclass{ int call() { return 2; } }
class Sub3 extends Superclass{ int call() { return 3; } }
Superclass[] mixed =initWithRandomInstances(10000);
Superclass[] solid =initWithSub1Instances(10000);
@Testpublic void testMonomorphic() {
int sum = 0;int m = solid.length;for (int i = 0; i < COUNT; i++)
sum += solid[i % m].call();guard = sum;
}
@Testpublic void testMegamorphic() {
int sum = 0;int m = mixed.length;for (int i = 0; i < COUNT; i++)
sum += mixed[i % m].call();guard = sum;
}
VM monomorphic megamorphic
sun-1.6.0-20 0.19 0.32sun-1.6.0-16 0.19 0.34sun-1.5.0-18 0.18 0.34
ibm-1.6.2 0.20 0.30jrockit-27.5.0 0.22 0.29
harmony-r917296 0.27 0.32
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
Example 6
@Testpublic void testBitCount1() {
int sum = 0;for (int i = 0; i < COUNT; i++)
sum += Integer.bitCount(i);guard = sum;
}
@Testpublic void testBitCount2() {
int sum = 0;for (int i = 0; i < COUNT; i++)
sum += bitCount(i);guard = sum;
}
/* Copied from* {@link Integer#bitCount}*/
static int bitCount(int i) {// HD, Figure 5-2i = i - ((i >>> 1)
& 0x55555555);i = (i & 0x33333333)
+ ((i >>> 2) & 0x33333333);i = (i + (i >>> 4))
& 0x0f0f0f0f;i = i + (i >>> 8);i = i + (i >>> 16);return i & 0x3f;
}
VM testBitCount1 testBitCount2
sun-1.6.0-20 0.43 0.43sun-1.7.0-b80 0.43 0.43
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM testBitCount1 testBitCount2
sun-1.6.0-20 0.08 0.33sun-1.7.0-b83 0.07 0.32
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Windows 7, Intel I7 860).
VM testBitCount1 testBitCount2
sun-1.6.0-20 0.43 0.43sun-1.7.0-b80 0.43 0.43
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM testBitCount1 testBitCount2
sun-1.6.0-20 0.08 0.33sun-1.7.0-b83 0.07 0.32
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Windows 7, Intel I7 860).
... -XX:+PrintInlining ...
...Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Example06.testBitCount1: [measured 10 out of 15 rounds]round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00] ...
@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)
Example06.testBitCount2: [measured 10 out of 15 rounds]round: 0.32 [+- 0.01], round.gc: 0.00 [+- 0.00] ...
... -XX:+PrintInlining ...
...Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Example06.testBitCount1: [measured 10 out of 15 rounds]round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00] ...
@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)
Example06.testBitCount2: [measured 10 out of 15 rounds]round: 0.32 [+- 0.01], round.gc: 0.00 [+- 0.00] ...
... -XX:+PrintOptoAssembly ...
{method}- klass: {other class}- method holder: com/dawidweiss/geecon2010/Example06- name: testBitCount1
...0c2 B13: # B12 B14 <- B8 B12 Loop: B13-B12 inner stride: ...0c2 movl R10, RDX # spill...0e1 movl [rsp + #40], R11 # spill0e6 popcnt R8, R8...0f5 addl R9, #7 # int0f9 popcnt R11, R110fe popcnt RCX, R9
... -XX:+PrintOptoAssembly ...
{method}- klass: {other class}- method holder: com/dawidweiss/geecon2010/Example06- name: testBitCount1
...0c2 B13: # B12 B14 <- B8 B12 Loop: B13-B12 inner stride: ...0c2 movl R10, RDX # spill...0e1 movl [rsp + #40], R11 # spill0e6 popcnt R8, R8...0f5 addl R9, #7 # int0f9 popcnt R11, R110fe popcnt RCX, R9
Conclusions
• Benchmarks must be statistically sound.→ averages, variance, min, max, warm-up phase
• Account for HotSpot optimisations.
• Account for hardware differences.→ test-on-target
• Use domain data and real scenarios.
• Inspect suspicious output with debug JVM.
See more: Cliff Click, http://java.sun.com/javaone/2009/articles/rockstar_click.jsp.
HPPCHigh Performance Primitive Collections
Motivation
• Primitive types: fast and memory-friendly.
• Optional assertions.
• Single-threaded. No fail-fast.
• Fast, fast, fast iterators, with no GC overhead.
• Open internals (explicit implementation).
• Programmers know what they’re doing.
Why not JCF?
public interface List<E> extends Collection<E> {boolean contains(Object o); // [-] contract-enforced methodsIterator<E> iterator(); // [-] iterators over primitive types?Object[] toArray(); // [-] troublesome covariants...
Friendly Competition• fastutil
• PCJ
• GNU Trove
• Apache Mahout (ported COLT)
• Apache Primitive Collections
All of these have pros and cons and deal with JCF compatibilitysomehow.
Iterators in fastutil or PCJ
interface IntIterator extends Iterator<Integer> {// Primitive-specific methodint nextInt();
}
Iterators in HPPC
public final class IntCursor {public int index;public int value;
}
public class IntArrayList extends Iterable<IntCursor> {Iterator<IntCursor> iterator() { ... }
}
Iterating over list elements in HPPC
for (IntCursor c : list) {System.out.println(c.index + ": " + c.value);
}
...or
list.forEach(new IntProcedure() {public void apply(int value) {
System.out.println(value);}
});
...or
final int [] buffer = list.buffer;final int size = list.size();
for (int i = 0; i < size; i++) {System.out.println(i + ": " + buffer[i]);
}
Iterating over list elements in HPPC
for (IntCursor c : list) {System.out.println(c.index + ": " + c.value);
}
...or
list.forEach(new IntProcedure() {public void apply(int value) {
System.out.println(value);}
});
...or
final int [] buffer = list.buffer;final int size = list.size();
for (int i = 0; i < size; i++) {System.out.println(i + ": " + buffer[i]);
}
Iterating over list elements in HPPC
for (IntCursor c : list) {System.out.println(c.index + ": " + c.value);
}
...or
list.forEach(new IntProcedure() {public void apply(int value) {
System.out.println(value);}
});
...or
final int [] buffer = list.buffer;final int size = list.size();
for (int i = 0; i < size; i++) {System.out.println(i + ": " + buffer[i]);
}
The fastest one?
What’s in HPPC?
Open implementation isgood.
/*** Applies a supplemental hash function to a given* hashCode, which defends against poor quality* hash functions. [...]*/
static int hash(int h) {// This function ensures that hashCodes that differ only by// constant multiples at each bit position have a bounded// number of collisions (approximately 8 at default load factor).h ^= (h >>> 20) ^ (h >>> 12);return h ^ (h >>> 7) ^ (h >>> 4);
}
HashMap rehashes your (carefully crafted) hash code.
HPPC approach (example):
public class LongIntOpenHashMap implements LongIntMap {// ...public LongIntOpenHashMap(int initialCapacity, float loadFactor,
LongHashFunction keyHashFunction, IntHashFunction valueHashFunction) {// ...
}
Defaults: LongMurmurHash, IntHashFunction.
Example 7
Frequency count of character bigrams in a given text.
• HPPC:
final char [] CHARS = DATA;final IntIntOpenHashMap counts = new IntIntOpenHashMap();for (int i = 0; i < CHARS.length - 1; i++) {
counts.putOrAdd((CHARS[i] << 16 | CHARS[i + 1]), 1, 1);}
• JCF, boxed integer types.
final Integer currentCount = map.get(bigram);map.put(bigram, currentCount == null ? 1 : currentCount + 1);
• JCF, with IntHolder (mutable value object).
• GNU Trove
map.adjustOrPutValue(bigram, 1, 1);
• fastutil, OpenHashMap and LinkedOpenHashMap
map.put(bigram, map.get(bigram) + 1);
• PCJ, OpenHashMap and ChainedHashMap
Is Java faster than C/C++?The short answer is: it depends.
— Cliff Click
Example 8
The same algorithm for building a DFSA automaton accepting aset of strings. Input: 3 565 575 strings, 158M of text.
gcc -O2 java 1.6.0_20-64
real
63.850s 43.197s
user
63.110s 46.370s
sys
0.240s 0.840s
Example 8
The same algorithm for building a DFSA automaton accepting aset of strings. Input: 3 565 575 strings, 158M of text.
gcc -O2 java 1.6.0_20-64
real
63.850s 43.197s
user
63.110s 46.370s
sys
0.240s 0.840s
Example 8
The same algorithm for building a DFSA automaton accepting aset of strings. Input: 3 565 575 strings, 158M of text.
gcc -O2 java 1.6.0_20-64
real 63.850s
43.197s
user 63.110s
46.370s
sys 0.240s
0.840s
Example 8
The same algorithm for building a DFSA automaton accepting aset of strings. Input: 3 565 575 strings, 158M of text.
gcc -O2 java 1.6.0_20-64
real 63.850s 43.197suser 63.110s 46.370ssys 0.240s 0.840s
Summary and Conclusions
Performance checklist(sanity check)
• Algorithms, algorithms, algorithms.
• Proper data structures.
• Spurious GC activity.
• Memory barriers in tight loops.
• CPU cache utilization.
• Low-level, hotspot-specific code structuring.
HPPC and junit-benchmarks are at:http://labs.carrotsearch.com