x10 at scale - ibm · x10 at scale olivier tardieu and the x10 team ... evolution of java with...

X10 at Scale

Olivier Tardieu and the X10 Team

http://x10-lang.org

This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.

X10 Overview

2

X10: Productivity and Performance at Scale

>8 years of R&D by IBM Research supported by DARPA (HPCS/PERCS) Programming language §  Bring Java-like productivity to HPC

§  evolution of Java with input from Scala, ZPL, CCP, … §  imperative OO language, garbage collected, type & memory safe §  rich data types and type system

§  Design for scale §  multi-core, multi-processor, distributed, heterogeneous systems §  few simple constructs for concurrency and distribution

Tool chain §  Open source compilers, runtime, IDE §  Debugger (not open source)

3

Place-shifting operations •  at(p) S •  at(p) e

… …… …

Ac%vi%es

Local Heap

Place 0

………

Ac%vi%es

Local Heap

Place N

…

Global Reference

Distributed heap •  GlobalRef[T] •  PlaceLocalHandle[T]

X10 in a Nutshell Asynchronous Partitioned Global Address Space (APGAS)

Fine-grained concurrency •  async S •  finish S

Atomicity within a place •  when(c) S •  atomic S

4

APGAS Idioms

§  Remote evaluation v = at(p) evalThere(arg1, arg2);!

§  Active message at(p) async runThere(arg1, arg2);

§  Recursive parallel decomposition def fib(n:Int):Int { if (n < 2) return n; val f1:Int; val f2:Int; finish { async f1 = fib(n-1); f2 = fib(n-2); } return f1 + f2; }!!

§  SPMD finish for (p in Place.places()) { at(p) async runEverywhere();!}!

§  Atomic remote update at(ref) async atomic ref() += v;!

§  Data exchange // swap row i local and j remote val h = here; val row_i = rows()(i); finish at(p) async { val row_j = rows()(j); rows()(j) = row_i; at(h) async row()(i) = row_j; }!!

5

Scale

6

What Scale?

X10 has places and asyncs in each place We want to §  Handle millions of asyncs (è billions) §  Handle tens of thousands of places (è millions) We need to §  Scale up

§  shared memory parallelism (today: 32 cores per place) §  schedule many asyncs with a few hardware threads

§  Scale out §  distributed memory parallelism (today: 50K places) §  provide mechanisms for efficient distribution (data & control) §  support distributed load balancing

7

Outline

8

Scale Out Work §  Experiments

§  DARPA PERCS prototype §  benchmarks

§  Deep Dive §  distributed termination detection §  high-performance interconnects §  memory management §  distributed load balancing

Scale Up Work §  Overview

§  Cilk-style scheduling for X10

Scale Out: Experiments

9

DARPA PERCS Prototype (Power 775)

§  Compute Node §  32 Power7 cores 3.84 GHz §  128 GB DRAM §  peak performance: 982 Gflops §  Torrent interconnect

§  Drawer §  8 nodes

§  Rack §  8 to 12 drawers

§  Full Prototype §  up to 1,740 compute nodes §  up to 55,680 cores §  up to 1.7 petaflops

§  1 petaflops with 1,024 compute nodes

10

Power 775 Drawer PCIe

Interconnectb

P7 QCM (8x)

D-Link Optical Interface Connects to other Super Nodes

Hub Module (8x)

D-Link Optical Interface Connects to other Super Nodes

Water Connection

D-Link Optical Fiber

L-Link Optical Interface Connects 4 Nodes to form Super Node

Memory DIMM’s (64x)

Memory DIMM’s (64x)

PCIe Interconnect

PCIe Interconnect

39”W x 72”D x 83”H

11

Eight Benchmarks

§  HPC Challenge benchmarks §  Linpack TOP500 (flops) §  Stream Triad local memory bandwidth §  Random Access distributed memory bandwidth §  Fast Fourier Transform mix

§  Machine learning kernels §  KMEANS graph clustering §  SSCA1 pattern matching §  SSCA2 irregular graph traversal §  UTS unbalanced tree traversal

Implemented in X10 as pure scale out tests §  One core = one place = one main async §  Native libraries for sequential math kernels: ESSL, FFTW, SHA1

12

Performance at Scale (Weak Scaling)

cores absolute performance

at scale

parallel efficiency (weak scaling)

performance relative to best implementation available

Stream 55,680 397 TB/s 98% 85% (lack of prefetching)

FFT 32,768 27 Tflops 93% 40% (no tuning of seq. code)

Linpack 32,768 589 Tflops 80% 80% (mix of limitations)

RandomAccess 32,768 843 Gups 100% 76% (network stack overhead)

KMeans 47,040 depends on parameters

97.8% 66% (vectorization issue)

SSCA1 47,040 depends on parameters

98.5% 100%

SSCA2 47,040 245 B edges/s > 75% no comparison data

UTS (geometric) 55,680 596 B nodes/s 98% reference code does not scale 4x to 16x faster than UPC code

13 13

HPCC Class 2 Competition: Best Performance Award

589231 22.4

18.0

16.00

18.00

20.00

22.00

24.00

0

200000

400000

600000

800000

0 16384 32768

Gflo

ps/p

lace

Gflo

ps

Places

G-HPL

396614

7.23

7.12

0

5

10

15

0

100000

200000

300000

400000

500000

0 27840 55680

GB

/s/p

lace

GB

/s

Places

EP Stream (Triad)

844

0.82 0.82

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

100

200

300

400

500

600

700

800

900

0 8192 16384 24576 32768

Gup

s/pl

ace

Gup

s

Places

G-RandomAccess

26958

0.88

0.82

0.00 0.20 0.40 0.60 0.80 1.00 1.20

0 5000

10000 15000 20000 25000 30000

0 16384 32768 G

flops

/pla

ce

Gflo

ps

Places

G-FFT

356344

596451 10.93 10.87

10.71

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

0

100000

200000

300000

400000

500000

600000

700000

0 13920 27840 41760 55680

Mill

ion

node

s/s/

plac

e

Mill

ion

node

s/s

Places

UTS

14

Scale Out: Deep Dive

15

Scalable Distributed Termination Detection

§  Distributed termination detection is hard §  arbitrary message reordering

§  Base algorithm §  one row of n counters per place with n places §  increment on spawn, decrement on termination, message on decrement §  finish triggered when sum of each column is zero

§  Optimized algorithms §  local aggregation and message batching (up to local quiescence) §  pattern-based specialization

§  local finish, SPMD finish, ping pong, single async §  software routing §  uncounted asyncs §  runtime optimizations + static analysis + pragmas scalable finish

16

Scalable Communication High-Performance Interconnects

§  RDMAs §  efficient remote memory operations §  fundamentally asynchronous good fit for APGAS

§  async semantics

Array.asyncCopy[Double](src, srcIndex, dst, dstIndex, size);!

§  Collectives §  multi-point coordination and communication §  all kinds of restrictions today poor fit for APGAS today

Team.WORLD.barrier(here.id);!columnTeam.addReduce(columnRole, localMax, Team.MAX);!

§  bright future (MPI-3 and much more…) good fit for APGAS

17 17

Row Swap in Linpack

// swap srcRow here with dstRow at place dst!def rowSwap(srcRow:Int, dstRow:Int, dst:Place) {! val srcBuffer = buffers(); // per-place buffer (PlaceLocalHandle)!

val srcBufferRef = new RemoteRail(srcBuffer);! val size = matrix().getRow(srcRow, srcBuffer);! @Pragma(Pragma.FINISH_ASYNC_AND_BACK) finish {! at (dst) async {! @Pragma(Pragma.FINISH_ASYNC) finish {! val dstBuffer = buffers();!

Array.asyncCopy[Double](srcBufferRef, 0, dstBuffer, 0, size);! }! matrix().swapRow(dstRow, dstBuffer);! Array.asyncCopy[Double](dstBuffer, 0, srcBufferRef, 0, size);! } !! }!

matrix().setRow(srcRow, srcBuffer);!}!!

18

Scalable Memory Management

§  Garbage collector §  problem 1: distributed heap §  solution: segregate local/remote refs not an issue in practice

§  GC for local refs; distributed GC experiment §  problem 2: risk of overhead and jitter §  solution: maximize memory reuse…

§  Congruent memory allocator §  problem: not all pages are created equal

§  large pages required to minimize TLB misses §  registered pages required for RDMAs §  congruent addresses required for RDMAs at scale

§  solution: dedicated memory allocator issue is contained §  configurable congruent registered memory region

§  backed by large pages if available §  only used for performance critical arrays

19

Time-Out

Scale Out Work §  Experiments

§  DARPA PERCS prototype §  benchmarks

§  Deep Dive §  distributed termination detection §  high-performance interconnects §  memory management §  distributed load balancing

Scale Up Work §  Overview

§  Cilk-style scheduling for X10

Implement traditional codes §  Statically scheduled and distributed §  With less pain

§  type safe & memory safe §  PGAS §  higher-level constructs

§  for concurrency and distribution

Implement new kind of codes

20

Scalable Global Load Balancing Unbalanced Tree Search

§  Problem §  count nodes in randomly generated tree §  separable cryptographic random number generator

childCount = f(nodeId)!childId = SHA1(nodeId, childIndex)!

è highly unbalanced trees è unpredictable è tree traversal can be relocated (no data dependencies, no locality)

§  Strategy §  dynamic distributed load balancing

§  effectively move work (node ids) from busy nodes to idle nodes §  deal? steal? startup?

§  effectively detect termination

good model for state space exploration problems

21

Scalable Global Load Balancing Unbalanced Tree Search

§  Lifeline-based global work stealing [PPoPP’11] §  n random victims then p lifelines (hypercube)

§  fixed graph with low degree and low diameter §  synchronous (steal) then asynchronous (deal)

§  Root finish accounts for §  startup asyncs + lifeline asyncs §  not random steal attempts

§  Compact work queue (for shallow trees) §  represent intervals of siblings §  thief steals half of each work item

§  Sparse communication graph §  bounded list of potential random victims §  finish trades contention for latency

genuine APGAS algorithm

22

UTS Main Loop

23

def process() {! alive = true;! while (!empty()) {! while (!empty()) { processAtMostN(); Runtime.probe(); deal(); }! steal();! }! alive = false;!}!!def steal() {! val h = here.id;! for (i:Int in 0..w) {! if (!empty()) break;! finish at (Place(victims(rnd.nextInt(m)))) async request(h, false);! }! for (lifeline:Int in lifelines) {! if (!empty()) break;! if (!lifelinesActivated(lifeline)) {! lifelinesActivated(lifeline) = true;! finish at (Place(lifeline)) async request(h, true);!} } }!

UTS Handling Thieves

24

def request(thief:Int, lifeline:Boolean) {! val nodes = take(); // grab nodes from the local queue! if (nodes == null) {! if (lifeline) lifelineThieves.push(thief);! return;! }! at (Place(thief)) async {! if (lifeline) lifelineActivated(thief) = false;! enqueue(nodes); // add nodes to the local queue!} }!!def deal() {! while (!lifelineThieves.empty()) {! val nodes = take(); // grab nodes from the local queue! if (nodes == null) return;! val thief = lifelineThieves.pop();! at (Place(thief)) async {! lifelineActivated(thief) = false;! enqueue(nodes); // add nodes to the local queue! if (!alive) process();!} } }!!

From Theory to Practice

From 256 cores in January 2011 to 7,936 in March 2012 to 47,040 in July 2012 Delivered in August 2012

good abstractions

25

?

Scale Up

26

Local Load Balancing

§  Problem: many more asyncs than execution units (hardware threads)

§  Solution: cooperative work-stealing scheduling §  pool of worker threads §  per-worker deque (double-ended queue) of pending jobs §  idle worker pops job from own deque or steals from other worker if empty

§  Question 1: what is a job? async p; q…!§  obvious candidate: async body (è fork/join-style scheduler) p!§  alternative: continuation (è cilk-style scheduler) q…!

§  Question 2: what if a job has to wait (IO, RPC…)? §  if possible: execute prereq job while waiting §  if not either: wait (but spawn new thread to maintain core utilization) §  or: save execution state and reuse thread

27

Fibonacci

static def fib(n:Int) {! if (n<=1) return n;!! val x:Int;! val y:Int;!! finish {! async x = fib(n-2);! y = fib(n-1);! }! return x+y;!}!

28

fib(10)

fib(9)

finish

fib(8)

async

fib(8)

finish

async

fib(7)

Cilk-Style Scheduling for X10

§  X10-source-to-X10-source program transformation §  exposes the “compiler stack”

§  transforms methods into frame classes §  transforms local variables into frame fields §  rewrite method invocations to save the parent frame pointer and pc

§  rewrite methods to permit reentry (resume execution at specified instruction) §  rewrite X10 concurrency constructs into invocations of the runtime scheduler

§  X10 runtime scheduler in X10 §  can walk the stack, save the stack, restore the stack §  can resume the execution at saved program point with restored stack

§  Hacks §  permit uninitialized fields §  speculative stack allocation

29

Micro Benchmarks

1.00

2.00

4.00

8.00

16.00

1 2 4 8 16

Cilk++

JavaFJ

X10WS

Worker Number

1.00

2.00

4.00

8.00

16.00

1.00

2.00

4.00

8.00

16.00

1 2 4 8 16

Cilk++

JavaFJ

X10WS

Worker Number

1.00

2.00

4.00

8.00

16.00Worker Number

1.00

2.00

4.00

8.00

16.00

1 2 4 8 16

Cilk++

JavaFJ

X10WS

Worker Number

0.736

0.133 0.283

1.834

0.349 0.491 0.430

0.087 0.267

0.000

0.500

1.000

1.500

2.000

Fib(40) Integrate(1536) QuickSort(10M)

Cilk++

JavaFJ

X10WS

14.63

4.23

1.64

18.91

3.63 1.69

7.00

2.57 1.28

0.00

5.00

10.00

15.00

20.00

Fib(40) Integrate(1536) QuickSort(10M)

Cilk++

JavaFJ

X10WS

Fib speedup Integrate speedup QuickSort speedup

execution time (s) with 16 threads sequential overhead

30

0

0.5

1

1.5

2

2.5

3

3.5

MI

MM.rp

MM.m

m

MM.m

m2d

CG

BFS.nd

BFS.d

RS.i

RS.iip

RS.d2i

RS.d8i

RS.d256i

RS.i256i

RS.i65ka

SQ.td

SQ.m

d

SQ.m

c

SQ.rd

SQ.m

i

SQ.sdai

SQ.sda

SQ.sida

SQ.sim

SQ.fdo

SQ.pdo

SQ.gi

SQ.gd

SQ.si

SQ.sd

SQ.sc6

SQ.sc5

SQ.sci

SQ.scc6

SQ.sccc6

SQ.scc5

SQ.sccc5

Cilk++ X10WS

X10 vs. Cilk++: PBBS Benchmarks

http://www.cs.cmu.edu/~guyb/pbbs/

§  Compared running time (s) with 16 cores

31

X10 vs. Cilk++: PBBS Benchmarks

http://www.cs.cmu.edu/~guyb/pbbs/

§  Compared sequential overhead

32

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

SQ.td

SQ.m

d

SQ.m

c

SQ.rd

SQ.m

i

SQ.sdai

SQ.sda

SQ.sida

SQ.sim

SQ.fd

o

SQ.pdo

SQ.gi

SQ.gd

SQ.si

SQ.sd

SQ.sc6

SQ.sc5

SQ.sci

SQ.scc6

SQ.sccc6

SQ.scc5

SQ.sccc5

Cilk++ X10WS

Wrap-Up

33

Conclusions

§  X10 can scale to Petaflop systems §  X10 can ease the development of scalable codes

§  for traditional statically scheduled and distributed codes §  for novel classes of codes (APGAS)

Future Work §  Tightly integrate the scale-out and scale-up work §  Continue and generalize work on distributed termination §  Systematize distributed load balancing §  Develop APGAS beyond X10

§  X10-runtime-as-a-library project

34

References

§  Main X10 website http://x10-lang.org

§  “A Brief Introduction to X10 (for the HPC Programmer)” http://x10.sourceforge.net/documentation/intro/intro-223.pdf

§  X10 2012 HPC challenge submission http://hpcchallenge.org http://x10.sourceforge.net/documentation/hpcc/x10-hpcc2012-paper.pdf

§  Unbalanced Tree Search in X10 http://dl.acm.org/citation.cfm?id=1941582

§  Cilk-style scheduling for X10 http://dl.acm.org/citation.cfm?id=2145850

35

x10 at scale - ibm · x10 at scale olivier tardieu and the x10 team ... evolution of java with...

Documents