nsf/darpa opaal adaptive parallelization strategies using data-driven objects

24
CPSD CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City

Upload: meg

Post on 19-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects. Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City. Outline. Quench and solidification codes Coarse grain parallelization of the quench code Adaptive parallelization techniques Dynamic variations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

1

CPSCPSDD

NSF/DARPA OPAAL

Adaptive Parallelization Strategies using Data-driven Objects

NSF/DARPA OPAAL

Adaptive Parallelization Strategies using Data-driven Objects

Laxmikant Kale

First Annual Review

27-28 October 1999, Iowa City

Page 2: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

2

CPSCPSDD

OutlineOutline

Quench and solidification codes Coarse grain parallelization of the quench

code Adaptive parallelization techniques

Dynamic variations Adaptive load balancing Finite element framework with adaptivity Preliminary results

Page 3: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

3

CPSCPSDD

Coarse grain parallelizationCoarse grain parallelization

Structure of current sequential quench code: 2-D array of elements (each independently refined) Within row dependence Independent rows, but…

—share global variables

Parallelization using Charm++: 3 hours effort (after a false start) about 20 lines of change to F90 code A 100 line Charm++ wrapper Observations:

—Global variables that are defined and used within inner loop iterations are easily dealt with in Charm++ , in contrast to OpenMP

—Dynamic load balancing is possible, but was unnecessary

Page 4: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

4

CPSCPSDD

Performance resultsPerformance results

Speedup for Micro1D

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

Speedup

pro

cess

ors

Contributors:

Engineering: N. Sobh, R. Haber

Computer Science: M. Bhandarkar, R. Liu, L. Kale

Page 5: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

5

CPSCPSDD

OpenMP experienceOpenMP experience

Work by: J. Hoeflinger, D. Padua, with N. Sobh, R. Haber, J. Dantzig,

N. Provatas

Solidification code: Parallelized using openMp Relatively straightforward, after a key decision

—Parallelize by rows only

Page 6: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

6

CPSCPSDD

OpenMP experienceOpenMP experience Quench code on Origin2000

Privatization of variables is needed

—as outer loop was parallelized Unexpected initial difficulties with OpenMP

—Led initially to large slowdown in parallelized code

—Traced to unnecessary locking in MATMUL intrinsic

0

50

100

150

200

250

300

350

400

0 1 2 3 4 5 6 7 8 9

Processors

Exe

cuti

on

tim

e in

sec

on

ds

Page 7: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

7

CPSCPSDD

Adaptive StrategiesAdaptive Strategies

Advanced codes model dynamic and irregular behavior

Solidification: adaptive grid refinement Quench:

—Complex dependencies,

—Parallelization within elements To parallelize these effectively,

—adaptive runtime strategies are necessary

Page 8: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

8

CPSCPSDD

Multi-partition decomposition:

Idea: decompose the problem into a number of partitions,

independent of the number of processors # Partitions > # Processors

The system maps partitions to processors The system should be able to map and re-map objects as

needed

Page 9: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

9

CPSCPSDD Charm++

A parallel C++ library Supports data driven objects

—singleton objects, object arrays, groups, Many objects per processor, with method execution scheduled

with availability of data System supports automatic instrumentation and object

migration Works with other paradigms: MPI, openMP, ..

Page 10: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

10

CPSCPSDD

Data driven executionin Charm++Data driven executionin Charm++

Scheduler Scheduler

Message Q Message Q

Page 11: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

11

CPSCPSDD

Load Balancing Framework

Aimed at handling ... Continuous (slow) load variation Abrupt load variation (refinement) Workstation clusters in multi-user mode

Measurement based Exploits temporal persistence of computation and communication

structures Very accurate (compared with estimation) instrumentation possible via Charm++/Converse

Page 12: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

12

CPSCPSDD

Object balancing frameworkObject balancing framework

Page 13: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

13

CPSCPSDD

Utility of the framework: workstation clustersUtility of the framework: workstation clusters

Cluster of 8 machines, One machine gets another job Parallel job slows down on all machines

Using the framework: Detection mechanism Migrate objects away from overloaded processor Restored almost original throughput!

Page 14: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

14

CPSCPSDD

Performance on timeshared clustersPerformance on timeshared clusters

Another user logged on at about 28 seconds into a parallel run on 8 workstations.

Throughput dipped from 10 steps per second to 7. The load balancer intervened

at 35 seconds,and restored throughput to almost its initial value.

Page 15: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

15

CPSCPSDD

Utility of the framework: Intrinsic load imbalanceUtility of the framework: Intrinsic load imbalance

To test the abilities of the framework A simple problem: Gauss-Jacobi iterations Refine selected sub-domains

ConSpector: web based tool Submit parallel jobs Monitor performance and application behavior Interact with running jobs via GUI interfaces

Page 16: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

16

CPSCPSDD

AppSpector view of Load balanceron the synthetic Jacobi relaxation benchmark.

Imbalance is introduced by interactively refining a subset ofcells around 9 seconds..The resultant load imbalancebrings the utilization down to 80%from the peak of 96%.

The load balancer kicks in aroundt = 16, and restores utilizationto around 94%.

Page 17: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

17

CPSCPSDD

Charm++

Converse

Load database + balancer

MPI-on-Charm Irecv+

AutomaticConversion from

MPI

FEM Structured

Cross module interpolation

Migration path

Frameworkpath

Using the Load Balancing Framework

Page 18: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

18

CPSCPSDD

Example application:Example application:

Crack propagation (P. Geubelle et al) Similar in structure to Quench components 1900 lines of F90

Rewritten using FEM framework in C++ 1200 lines of C++ code Framework: 500 lines of code,

—reused by all applications Parallelization completely by the framework

Page 19: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

19

CPSCPSDD

Crack PropagationCrack Propagation

Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle

Page 20: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

20

CPSCPSDD

“Overhead” of multi-partition method“Overhead” of multi-partition method

0

5

10

15

20

25

30

35

40

1 10 100 1000

Number of partitions

Page 21: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

21

CPSCPSDD

Overhead study on 8 processorsOverhead study on 8 processors

0

2

4

6

8

10

12

1 10 100

Number of chunks per processor

Exe

cuti

on

tim

e (o

n 8

pro

cess

ors

)

When running on 8 processors, the effect of using multiple partitions per

processor is also beneficial, due to cache behavior.

Page 22: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

22

CPSCPSDD

Cross-approach comparisonCross-approach comparison

Performance compariosn across approaches

0

0.5

1

1.5

2

2.5

3

3.5

4

1 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128

Number of partitions

Exe

cuti

on

tim

e in

sec

on

ds

MPI-F90 original Charm++ framework(all C++) F90 + charm++ library

Page 23: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

23

CPSCPSDD

Load balancer in actionLoad balancer in action

Page 24: NSF/DARPA OPAAL Adaptive Parallelization Strategies  using Data-driven Objects

24

CPSCPSDD

Summary and Planned ResearchSummary and Planned Research

Use the adaptive FEM framework To parallelize Quench code further Quad tree based solidification code:

—First phase: parallelize each phase separately

—Parallelize across refinement phases

Refine the FEM framework Use feedback from applications Support for implicit solvers and multigrid