x10 at scale - ibm · x10 at scale olivier tardieu and the x10 team ... evolution of java with...
TRANSCRIPT
X10 at Scale
Olivier Tardieu and the X10 Team
http://x10-lang.org
This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.
X10 Overview
2
X10: Productivity and Performance at Scale
>8 years of R&D by IBM Research supported by DARPA (HPCS/PERCS) Programming language § Bring Java-like productivity to HPC
§ evolution of Java with input from Scala, ZPL, CCP, … § imperative OO language, garbage collected, type & memory safe § rich data types and type system
§ Design for scale § multi-core, multi-processor, distributed, heterogeneous systems § few simple constructs for concurrency and distribution
Tool chain § Open source compilers, runtime, IDE § Debugger (not open source)
3
Place-shifting operations • at(p) S • at(p) e
… …… …
Ac%vi%es
Local Heap
Place 0
………
Ac%vi%es
Local Heap
Place N
…
Global Reference
Distributed heap • GlobalRef[T] • PlaceLocalHandle[T]
X10 in a Nutshell Asynchronous Partitioned Global Address Space (APGAS)
Fine-grained concurrency • async S • finish S
Atomicity within a place • when(c) S • atomic S
4
APGAS Idioms
§ Remote evaluation v = at(p) evalThere(arg1, arg2);!
§ Active message at(p) async runThere(arg1, arg2);
§ Recursive parallel decomposition def fib(n:Int):Int { if (n < 2) return n; val f1:Int; val f2:Int; finish { async f1 = fib(n-1); f2 = fib(n-2); } return f1 + f2; }!!
§ SPMD finish for (p in Place.places()) { at(p) async runEverywhere();!}!
§ Atomic remote update at(ref) async atomic ref() += v;!
§ Data exchange // swap row i local and j remote val h = here; val row_i = rows()(i); finish at(p) async { val row_j = rows()(j); rows()(j) = row_i; at(h) async row()(i) = row_j; }!!
5
Scale
6
What Scale?
X10 has places and asyncs in each place We want to § Handle millions of asyncs (è billions) § Handle tens of thousands of places (è millions) We need to § Scale up
§ shared memory parallelism (today: 32 cores per place) § schedule many asyncs with a few hardware threads
§ Scale out § distributed memory parallelism (today: 50K places) § provide mechanisms for efficient distribution (data & control) § support distributed load balancing
7
Outline
8
Scale Out Work § Experiments
§ DARPA PERCS prototype § benchmarks
§ Deep Dive § distributed termination detection § high-performance interconnects § memory management § distributed load balancing
Scale Up Work § Overview
§ Cilk-style scheduling for X10
Scale Out: Experiments
9
DARPA PERCS Prototype (Power 775)
§ Compute Node § 32 Power7 cores 3.84 GHz § 128 GB DRAM § peak performance: 982 Gflops § Torrent interconnect
§ Drawer § 8 nodes
§ Rack § 8 to 12 drawers
§ Full Prototype § up to 1,740 compute nodes § up to 55,680 cores § up to 1.7 petaflops
§ 1 petaflops with 1,024 compute nodes
10
Power 775 Drawer PCIe
Interconnectb
P7 QCM (8x)
D-Link Optical Interface Connects to other Super Nodes
Hub Module (8x)
D-Link Optical Interface Connects to other Super Nodes
Water Connection
D-Link Optical Fiber
L-Link Optical Interface Connects 4 Nodes to form Super Node
Memory DIMM’s (64x)
Memory DIMM’s (64x)
PCIe Interconnect
PCIe Interconnect
39”W x 72”D x 83”H
11
Eight Benchmarks
§ HPC Challenge benchmarks § Linpack TOP500 (flops) § Stream Triad local memory bandwidth § Random Access distributed memory bandwidth § Fast Fourier Transform mix
§ Machine learning kernels § KMEANS graph clustering § SSCA1 pattern matching § SSCA2 irregular graph traversal § UTS unbalanced tree traversal
Implemented in X10 as pure scale out tests § One core = one place = one main async § Native libraries for sequential math kernels: ESSL, FFTW, SHA1
12
Performance at Scale (Weak Scaling)
cores absolute performance
at scale
parallel efficiency (weak scaling)
performance relative to best implementation available
Stream 55,680 397 TB/s 98% 85% (lack of prefetching)
FFT 32,768 27 Tflops 93% 40% (no tuning of seq. code)
Linpack 32,768 589 Tflops 80% 80% (mix of limitations)
RandomAccess 32,768 843 Gups 100% 76% (network stack overhead)
KMeans 47,040 depends on parameters
97.8% 66% (vectorization issue)
SSCA1 47,040 depends on parameters
98.5% 100%
SSCA2 47,040 245 B edges/s > 75% no comparison data
UTS (geometric) 55,680 596 B nodes/s 98% reference code does not scale 4x to 16x faster than UPC code
13 13
HPCC Class 2 Competition: Best Performance Award
589231 22.4
18.0
16.00
18.00
20.00
22.00
24.00
0
200000
400000
600000
800000
0 16384 32768
Gflo
ps/p
lace
Gflo
ps
Places
G-HPL
396614
7.23
7.12
0
5
10
15
0
100000
200000
300000
400000
500000
0 27840 55680
GB
/s/p
lace
GB
/s
Places
EP Stream (Triad)
844
0.82 0.82
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
100
200
300
400
500
600
700
800
900
0 8192 16384 24576 32768
Gup
s/pl
ace
Gup
s
Places
G-RandomAccess
26958
0.88
0.82
0.00 0.20 0.40 0.60 0.80 1.00 1.20
0 5000
10000 15000 20000 25000 30000
0 16384 32768 G
flops
/pla
ce
Gflo
ps
Places
G-FFT
356344
596451 10.93 10.87
10.71
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
0
100000
200000
300000
400000
500000
600000
700000
0 13920 27840 41760 55680
Mill
ion
node
s/s/
plac
e
Mill
ion
node
s/s
Places
UTS
14
Scale Out: Deep Dive
15
Scalable Distributed Termination Detection
§ Distributed termination detection is hard § arbitrary message reordering
§ Base algorithm § one row of n counters per place with n places § increment on spawn, decrement on termination, message on decrement § finish triggered when sum of each column is zero
§ Optimized algorithms § local aggregation and message batching (up to local quiescence) § pattern-based specialization
§ local finish, SPMD finish, ping pong, single async § software routing § uncounted asyncs § runtime optimizations + static analysis + pragmas scalable finish
16
Scalable Communication High-Performance Interconnects
§ RDMAs § efficient remote memory operations § fundamentally asynchronous good fit for APGAS
§ async semantics
Array.asyncCopy[Double](src, srcIndex, dst, dstIndex, size);!
§ Collectives § multi-point coordination and communication § all kinds of restrictions today poor fit for APGAS today
Team.WORLD.barrier(here.id);!columnTeam.addReduce(columnRole, localMax, Team.MAX);!
§ bright future (MPI-3 and much more…) good fit for APGAS
17 17
Row Swap in Linpack
// swap srcRow here with dstRow at place dst!def rowSwap(srcRow:Int, dstRow:Int, dst:Place) {! val srcBuffer = buffers(); // per-place buffer (PlaceLocalHandle)!
val srcBufferRef = new RemoteRail(srcBuffer);! val size = matrix().getRow(srcRow, srcBuffer);! @Pragma(Pragma.FINISH_ASYNC_AND_BACK) finish {! at (dst) async {! @Pragma(Pragma.FINISH_ASYNC) finish {! val dstBuffer = buffers();!
Array.asyncCopy[Double](srcBufferRef, 0, dstBuffer, 0, size);! }! matrix().swapRow(dstRow, dstBuffer);! Array.asyncCopy[Double](dstBuffer, 0, srcBufferRef, 0, size);! } !! }!
matrix().setRow(srcRow, srcBuffer);!}!!
18
Scalable Memory Management
§ Garbage collector § problem 1: distributed heap § solution: segregate local/remote refs not an issue in practice
§ GC for local refs; distributed GC experiment § problem 2: risk of overhead and jitter § solution: maximize memory reuse…
§ Congruent memory allocator § problem: not all pages are created equal
§ large pages required to minimize TLB misses § registered pages required for RDMAs § congruent addresses required for RDMAs at scale
§ solution: dedicated memory allocator issue is contained § configurable congruent registered memory region
§ backed by large pages if available § only used for performance critical arrays
19
Time-Out
Scale Out Work § Experiments
§ DARPA PERCS prototype § benchmarks
§ Deep Dive § distributed termination detection § high-performance interconnects § memory management § distributed load balancing
Scale Up Work § Overview
§ Cilk-style scheduling for X10
Implement traditional codes § Statically scheduled and distributed § With less pain
§ type safe & memory safe § PGAS § higher-level constructs
§ for concurrency and distribution
Implement new kind of codes
20
Scalable Global Load Balancing Unbalanced Tree Search
§ Problem § count nodes in randomly generated tree § separable cryptographic random number generator
childCount = f(nodeId)!childId = SHA1(nodeId, childIndex)!
è highly unbalanced trees è unpredictable è tree traversal can be relocated (no data dependencies, no locality)
§ Strategy § dynamic distributed load balancing
§ effectively move work (node ids) from busy nodes to idle nodes § deal? steal? startup?
§ effectively detect termination
good model for state space exploration problems
21
Scalable Global Load Balancing Unbalanced Tree Search
§ Lifeline-based global work stealing [PPoPP’11] § n random victims then p lifelines (hypercube)
§ fixed graph with low degree and low diameter § synchronous (steal) then asynchronous (deal)
§ Root finish accounts for § startup asyncs + lifeline asyncs § not random steal attempts
§ Compact work queue (for shallow trees) § represent intervals of siblings § thief steals half of each work item
§ Sparse communication graph § bounded list of potential random victims § finish trades contention for latency
genuine APGAS algorithm
22
UTS Main Loop
23
def process() {! alive = true;! while (!empty()) {! while (!empty()) { processAtMostN(); Runtime.probe(); deal(); }! steal();! }! alive = false;!}!!def steal() {! val h = here.id;! for (i:Int in 0..w) {! if (!empty()) break;! finish at (Place(victims(rnd.nextInt(m)))) async request(h, false);! }! for (lifeline:Int in lifelines) {! if (!empty()) break;! if (!lifelinesActivated(lifeline)) {! lifelinesActivated(lifeline) = true;! finish at (Place(lifeline)) async request(h, true);!} } }!
UTS Handling Thieves
24
def request(thief:Int, lifeline:Boolean) {! val nodes = take(); // grab nodes from the local queue! if (nodes == null) {! if (lifeline) lifelineThieves.push(thief);! return;! }! at (Place(thief)) async {! if (lifeline) lifelineActivated(thief) = false;! enqueue(nodes); // add nodes to the local queue!} }!!def deal() {! while (!lifelineThieves.empty()) {! val nodes = take(); // grab nodes from the local queue! if (nodes == null) return;! val thief = lifelineThieves.pop();! at (Place(thief)) async {! lifelineActivated(thief) = false;! enqueue(nodes); // add nodes to the local queue! if (!alive) process();!} } }!!
From Theory to Practice
From 256 cores in January 2011 to 7,936 in March 2012 to 47,040 in July 2012 Delivered in August 2012
good abstractions
25
?
Scale Up
26
Local Load Balancing
§ Problem: many more asyncs than execution units (hardware threads)
§ Solution: cooperative work-stealing scheduling § pool of worker threads § per-worker deque (double-ended queue) of pending jobs § idle worker pops job from own deque or steals from other worker if empty
§ Question 1: what is a job? async p; q…!§ obvious candidate: async body (è fork/join-style scheduler) p!§ alternative: continuation (è cilk-style scheduler) q…!
§ Question 2: what if a job has to wait (IO, RPC…)? § if possible: execute prereq job while waiting § if not either: wait (but spawn new thread to maintain core utilization) § or: save execution state and reuse thread
27
Fibonacci
static def fib(n:Int) {! if (n<=1) return n;!! val x:Int;! val y:Int;!! finish {! async x = fib(n-2);! y = fib(n-1);! }! return x+y;!}!
28
fib(10)
fib(9)
finish
fib(8)
async
fib(8)
finish
async
fib(7)
Cilk-Style Scheduling for X10
§ X10-source-to-X10-source program transformation § exposes the “compiler stack”
§ transforms methods into frame classes § transforms local variables into frame fields § rewrite method invocations to save the parent frame pointer and pc
§ rewrite methods to permit reentry (resume execution at specified instruction) § rewrite X10 concurrency constructs into invocations of the runtime scheduler
§ X10 runtime scheduler in X10 § can walk the stack, save the stack, restore the stack § can resume the execution at saved program point with restored stack
§ Hacks § permit uninitialized fields § speculative stack allocation
29
Micro Benchmarks
1.00
2.00
4.00
8.00
16.00
1 2 4 8 16
Cilk++
JavaFJ
X10WS
Worker Number
1.00
2.00
4.00
8.00
16.00
1.00
2.00
4.00
8.00
16.00
1 2 4 8 16
Cilk++
JavaFJ
X10WS
Worker Number
1.00
2.00
4.00
8.00
16.00Worker Number
1.00
2.00
4.00
8.00
16.00
1 2 4 8 16
Cilk++
JavaFJ
X10WS
Worker Number
0.736
0.133 0.283
1.834
0.349 0.491 0.430
0.087 0.267
0.000
0.500
1.000
1.500
2.000
Fib(40) Integrate(1536) QuickSort(10M)
Cilk++
JavaFJ
X10WS
14.63
4.23
1.64
18.91
3.63 1.69
7.00
2.57 1.28
0.00
5.00
10.00
15.00
20.00
Fib(40) Integrate(1536) QuickSort(10M)
Cilk++
JavaFJ
X10WS
Fib speedup Integrate speedup QuickSort speedup
execution time (s) with 16 threads sequential overhead
30
0
0.5
1
1.5
2
2.5
3
3.5
MI
MM.rp
MM.m
m
MM.m
m2d
CG
BFS.nd
BFS.d
RS.i
RS.iip
RS.d2i
RS.d8i
RS.d256i
RS.i256i
RS.i65ka
SQ.td
SQ.m
d
SQ.m
c
SQ.rd
SQ.m
i
SQ.sdai
SQ.sda
SQ.sida
SQ.sim
SQ.fdo
SQ.pdo
SQ.gi
SQ.gd
SQ.si
SQ.sd
SQ.sc6
SQ.sc5
SQ.sci
SQ.scc6
SQ.sccc6
SQ.scc5
SQ.sccc5
Cilk++ X10WS
X10 vs. Cilk++: PBBS Benchmarks
http://www.cs.cmu.edu/~guyb/pbbs/
§ Compared running time (s) with 16 cores
31
X10 vs. Cilk++: PBBS Benchmarks
http://www.cs.cmu.edu/~guyb/pbbs/
§ Compared sequential overhead
32
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
SQ.td
SQ.m
d
SQ.m
c
SQ.rd
SQ.m
i
SQ.sdai
SQ.sda
SQ.sida
SQ.sim
SQ.fd
o
SQ.pdo
SQ.gi
SQ.gd
SQ.si
SQ.sd
SQ.sc6
SQ.sc5
SQ.sci
SQ.scc6
SQ.sccc6
SQ.scc5
SQ.sccc5
Cilk++ X10WS
Wrap-Up
33
Conclusions
§ X10 can scale to Petaflop systems § X10 can ease the development of scalable codes
§ for traditional statically scheduled and distributed codes § for novel classes of codes (APGAS)
Future Work § Tightly integrate the scale-out and scale-up work § Continue and generalize work on distributed termination § Systematize distributed load balancing § Develop APGAS beyond X10
§ X10-runtime-as-a-library project
34
References
§ Main X10 website http://x10-lang.org
§ “A Brief Introduction to X10 (for the HPC Programmer)” http://x10.sourceforge.net/documentation/intro/intro-223.pdf
§ X10 2012 HPC challenge submission http://hpcchallenge.org http://x10.sourceforge.net/documentation/hpcc/x10-hpcc2012-paper.pdf
§ Unbalanced Tree Search in X10 http://dl.acm.org/citation.cfm?id=1941582
§ Cilk-style scheduling for X10 http://dl.acm.org/citation.cfm?id=2145850
35