a scalable heterogeneous parallelization framework for iterative local searches

A Scalable Heterogeneous Parallelization Framework for

Iterative Local Searches

Martin Burtscher1 and Hassan Rabeti2

1Department of Computer Science, Texas State University-San Marcos2Department of Mathematics, Texas State University-San Marcos

2

Problem: HPC is Hard to Exploit HPC application writers are domain experts

They are not typically computer scientists and have little or no formal education in parallel programming

Parallel programming is difficult and error prone

Modern HPC systems are complex Consist of interconnected compute nodes with

multiple CPUs and one or more GPUs per node Require parallelization at multiple levels (inter-node,

intra-node, and accelerator) for best performance

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

3

Target Area: Iterative Local Searches Important application domain

Widely used in engineering & real-time environments Examples

All sorts of random restart greedy algorithms Ant colony opt, Monte Carlo, n-opt hill climbing, etc.

ILS properties Iteratively produce better solutions Can exploit large amounts of parallelism Often have exponential search space


4

Our Solution: ILCS Framework Iterative Local Champion Search (ILCS) framework

Supports non-random restart heuristics Genetic algorithms, tabu search, particle swarm opt, etc.

Simplifies implementation of ILS on parallel systems Design goal

Ease of use and scalability Framework benefits

Handles threading, communication, locking, resource allocation, heterogeneity, load balance, termination decision, and result recording (check pointing)


5

User Interface User writes 3 serial C functions and/or 3 single-

GPU CUDA functions with some restrictions

size_t CPU_Init(int argc, char *argv[]);

void CPU_Exec(long seed, void const *champion, void *result);

void CPU_Output(void const *champion);

See paper for GPU interface and sample code Framework runs Exec (map) functions in parallelA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

6

Internal Operation: Threading


Fc Fc Fc …

Fc Fc Fc …

Fc Fc Fc …

Fc Fc Fc …

…

Fg h h Fg h h Fg h … GPU handler thread #1

Fg h h Fg h h Fg h … GPU handler thread #2

user GPU code user GPU code

Fm Fm

user CPU code user CPU code user CPU code

user CPU code

user CPU code

user CPU code

user CPU codeuser CPU code

user CPU code

user CPU code

user CPU code user CPU code

worker thread #1

GPU2 worker threads

master/comm thread

GPU1 worker threads

…

…

worker thread #4

worker thread #3

worker thread #2

user GPU code user GPU code

ILCS master thread starts

master forks a worker per core

master forks a handler per GPU

workers evaluate seeds, record local opt

GPU workers evaluate seeds, record local opt

handlers launch GPU code, sleep, record result

master sporadically finds global opt via MPI, sleeps

7

Internal Operation: Seed Distribution E.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2)

Benefits Balanced workload irrespective of number of CPU

cores or GPUs (or their relative performance) Users can generate other distributions from seeds

Any injective mapping results in no redundant evaluationsA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

CPUs GPUs CPUs GPUs CPUs GPUs CPUs GPUs

0, 1, 2, … …, 263-1, 263, … …, 264-2, 264-1

262, ... …, 263-1a b c d a b c d a b 1 2 1 2 1 2 1 2 1 2 1CPU threads (one seed per thread at a time) GPUs (strided range of seeds per GPU at a time)

Node 0 Node 1 Node 2 Node 3

each node gets chunk of 64-bit seed range

CPUs process chunk bottom up

GPUs process chunk top down

8

Related Work MapReduce/Hadoop/MARS and PADO

Their generality and unnecessary features for ILS incur overhead and increase learning curve

Some do not support accelerators, some require Java

ILCS framework is optimized for ILS applications Reduction is provided, does not require multiple keys,

does not need secondary storage to buffer data, directly supports non-random restart heuristics, allows early termination, works with GPUs and MICs, targets single-node workstations through HPC clusters


9

Evaluation Methodology Three HPC Systems (at TACC and NICS)

Largest tested configuration


compute CPU CPU clock GPU GPU clocknodes cores frequency cores frequency

Keeneland 264 528 4,224 2.6 GHz 792 405,504 1.3 GHzRanger 3,936 15,744 62,976 2.3 GHz - - -Stampede 6,400 12,800 102,400 2.7 GHz 128* n/a n/a

system CPUs GPUs

compute total total total totalnodes CPUs GPUs CPU cores GPU cores

Keeneland 128 256 384 2048 196,608Ranger 2048 8192 0 32768 0Stampede 1024 2048 0 16384 0

system

datacenterknowledge.com

10

Sample ILS Codes Traveling Salesman

Problem (TSP) Find shortest tour

4 inputs from TSPLIB 2-opt hill climbing

Finite State Machine (FSM) Find best FSM config to

predict hit/miss events 4 sizes (n = 3, 4, 5, 6) Monte Carlo method


0 … 0 0 0 --> d … b a0 … 0 1 0 --> h … f e: : : : --> : : :1 … 1 1 0 --> m … k j0 … 0 0 1 --> q … o n0 … 0 1 1 --> u … s r: : : : --> : : :1 … 1 1 1 --> z … x w

transition

statenextcurrent

state in b

it

n bitsn bits

2n +1 entries

index table

11

FSM Transitions/Second Evaluated


0.0

5.0

10.0

15.0

20.0

25.0

3-bit FSM 4-bit FSM 5-bit FSM 6-bit FSM

tran

sitio

ns e

valu

ated

per

sec (

trill

ions

)

Keeneland

Ranger

Stampede

21,532,197,798,304 s-1

GPU shmem limit

Ranger uses twice as many cores as Stampede

12

TSP Tour-Changes/Second Evaluated


0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

kroE100 ts225 rat575 d1291

mov

es e

valu

ated

per

seco

nd (t

rillio

ns)

Keeneland

Ranger

Stampede

12,239,050,704,370 s-1 based on serial CPU code

CPU pre-computes: O(n2) memory

GPU re-computes: O(n) memory

each core evals a tour change every 3.6

cycles

13

TSP Moves/Second/Node Evaluated


0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

kroE100 ts225 rat575 d1291

mov

es e

valu

ated

per

seco

nd (b

illio

ns)

Keeneland

Ranger

Stampede

GPUs provide >90% of performance on Keeneland

14

ILCS Scaling on Ranger (FSM)


1

10

100

1000

10000

tran

sitio

ns e

valu

ated

per

sec (

billi

ons)

compute nodes

3-bit FSM

4-bit FSM

5-bit FSM

6-bit FSM

>99% parallel efficiency on 2048 nodes

other two systems are similar

15

ILCS Scaling on Ranger (TSP)


0.1

1

10

100

1000

10000

100000

mov

es e

valu

ated

per

seco

nd (b

illio

ns)

compute nodes

kroE100

ts225

rat575

d1291

>95% parallel efficiency on 2048 nodes

longer runs are even better

16

Intra-Node Scaling on Stampede (TSP)


0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

mov

es e

valu

ated

per

seco

nd (b

illio

ns)

worker threads

kroE100

ts225

rat575

d1291

>98.9% parallel efficiency on 16 threads

framework overhead is very small

17

Tour Quality Evolution (Keeneland)


0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

8.0%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

devi

ation

from

opti

mal

tour

leng

th

step

kroE100

ts225

rat575

d1291

quality depends on chance: ILS provides good solution quickly, then progressively improves it

18

Tour Quality after 6 Steps (Stampede)


0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

8.0%

9.0%

1 2 4 8 16 32 64 128 256 512 1024

devi

ation

from

opti

mal

tour

leng

th

compute nodes

kroE100

ts225

rat575

d1291

larger node counts typically yield better results faster

19

Summary and Conclusions ILCS Framework

Automatic parallelization of iterative local searches Provides MPI, OpenMP, and multi-GPU support

Checkpoints currently best solution every few seconds Scales very well (decentralized)

Evaluation 2-opt hill climbing (TSP) and Monte Carlo method (FSM) AMD + Intel CPUs, NVIDIA GPUs, and Intel MICs

ILCS source code is freely available http://cs.txstate.edu/~burtscher/research/ILCS/

Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS


a scalable heterogeneous parallelization framework for iterative local searches

Documents

iterative local searches3

iterative local searches7

noderequire parallelization

local optgpu workers

java ilcs framework

argvvoid cpu

resultvoid cpu

number of cpu cores