towardsautonomicapplication-sensitivepartitioning ...parashar/papers/armada-jpdc-05.pdf · armada...

13
J. Parallel Distrib. Comput. 60 (2005) 519 – 531 www.elsevier.com/locate/jpdc Towards autonomic application-sensitive partitioning for SAMR applications Sumir Chandra , Manish Parashar The Applied Software Systems Laboratory, Department of Electrical and Computer Engineering, Rutgers, The State University of New Jersey, 94 Brett Road, Piscataway, NJ 08854 USA Received 21 August 2003; received in revised form 14 June 2004; accepted 14 September 2004 Abstract Distributed structured adaptive mesh refinement (SAMR) techniques offer the potential for accurate and cost-effective solutions of physically realistic models of complex physical phenomena. However, the heterogeneous and dynamic nature of SAMR applications results in significant runtime management challenges. This paper investigates autonomic application-sensitive SAMR runtime management strategies and presents the design, implementation, and evaluation of ARMaDA, a self-adapting and optimizing partitioning framework for SAMR applications. ARMaDA monitors and characterizes application runtime state, and dynamically selects and invokes appropriate partitioning mechanisms that match current SAMR state and optimize its computational and communication performance. The advantages of the autonomic partitioning capabilities provided by ARMaDA are experimentally demonstrated. © 2005 Elsevier Inc. All rights reserved. Keywords: Autonomic partitioning; Dynamic applications; Runtime application characterization; Structured adaptive mesh refinement 1. Introduction Dynamic structured adaptive mesh refinement (SAMR) [3] techniques for the solutions of partial differential equations enable computational effort and resources to be dynamically concentrated to appropriate regions of the computational domain by overlaying uniform patch-based refinements on these regions. These techniques can yield highly advantageous ratios for cost/accuracy as compared to techniques based on static approximations. SAMR tech- niques are being effectively used in several application domains including computational fluid dynamics [1,2,17], The research presented in this paper was supported in part by the Na- tional Science Foundation via grants numbers ACI 9984357, EIA 0103674 EIA-0120934, ANI 0335244, CNS 0426354 and IIS 0430826, and by DOE ASCI/ASAP via grants numbers PC295251 and 82-1052856 awarded to Manish Parashar. Corresponding author. Fax: +1 732 445 0593. E-mail addresses: [email protected] (S. Chandra), [email protected] (M. Parashar). 0743-7315/$ - see front matter © 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2004.09.020 numerical relativity [8,10], astrophysics [4,12], subsurface modeling and oil reservoir simulation [16,23]. SAMR-based realizations of complex applications can provide dramatic insights into physical phenomena such as interacting black holes and neutron stars, formations of galaxies, subsurface flows in oil reservoirs and aquifers, and dynamic response of materials to detonation. However, the phenomena being modeled by these applications and their implementations are inherently multi-phased, dynamic, and heterogeneous in time, space, and state. Emerging large- scale parallel/distributed computing environments are simi- larly complex, heterogeneous, and dynamic. As a result, scal- able implementations of SAMR applications in distributed environments present significant challenges in runtime man- agement, adaptation, and optimization. In this paper, we in- vestigate autonomic 1 application-sensitive approaches for 1 Autonomic computing is based on the strategies used by biological systems to deal with complexity, heterogeneity, and uncertainty, and en- ables applications that are self-configuring, optimizing, self-healing, con- textually aware, and open.

Upload: others

Post on 15-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

J. Parallel Distrib. Comput. 60 (2005) 519–531www.elsevier.com/locate/jpdc

Towards autonomic application-sensitive partitioningfor SAMR applications�

Sumir Chandra∗, Manish ParasharThe Applied Software Systems Laboratory, Department of Electrical and Computer Engineering, Rutgers, The State University of New Jersey,

94 Brett Road, Piscataway, NJ 08854 USA

Received 21 August 2003; received in revised form 14 June 2004; accepted 14 September 2004

Abstract

Distributed structured adaptive mesh refinement (SAMR) techniques offer the potential for accurate and cost-effective solutions ofphysically realistic models of complex physical phenomena. However, the heterogeneous and dynamic nature of SAMR applicationsresults in significant runtime management challenges. This paper investigates autonomic application-sensitive SAMR runtime managementstrategies and presents the design, implementation, and evaluation of ARMaDA, a self-adapting and optimizing partitioning frameworkfor SAMR applications. ARMaDA monitors and characterizes application runtime state, and dynamically selects and invokes appropriatepartitioning mechanisms that match current SAMR state and optimize its computational and communication performance. The advantagesof the autonomic partitioning capabilities provided by ARMaDA are experimentally demonstrated.© 2005 Elsevier Inc. All rights reserved.

Keywords:Autonomic partitioning; Dynamic applications; Runtime application characterization; Structured adaptive mesh refinement

1. Introduction

Dynamic structured adaptive mesh refinement (SAMR)[3] techniques for the solutions of partial differentialequations enable computational effort and resources tobe dynamically concentrated to appropriate regions of thecomputational domain by overlaying uniform patch-basedrefinements on these regions. These techniques can yieldhighly advantageous ratios for cost/accuracy as comparedto techniques based on static approximations. SAMR tech-niques are being effectively used in several applicationdomains including computational fluid dynamics [1,2,17],

� The research presented in this paper was supported in part by the Na-tional Science Foundation via grants numbers ACI 9984357, EIA 0103674EIA-0120934, ANI 0335244, CNS 0426354 and IIS 0430826, and by DOEASCI/ASAP via grants numbers PC295251 and 82-1052856 awarded toManish Parashar.

∗ Corresponding author. Fax: +1 732 445 0593.E-mail addresses:[email protected](S. Chandra),

[email protected](M. Parashar).

0743-7315/$ - see front matter © 2005 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2004.09.020

numerical relativity[8,10], astrophysics [4,12], subsurfacemodeling and oil reservoir simulation [16,23].

SAMR-based realizations of complex applications canprovide dramatic insights into physical phenomena suchas interacting black holes and neutron stars, formations ofgalaxies, subsurface flows in oil reservoirs and aquifers, anddynamic response of materials to detonation. However, thephenomena being modeled by these applications and theirimplementations are inherently multi-phased, dynamic, andheterogeneous in time, space, and state. Emerging large-scale parallel/distributed computing environments are simi-larly complex, heterogeneous, and dynamic.As a result, scal-able implementations of SAMR applications in distributedenvironments present significant challenges in runtime man-agement, adaptation, and optimization. In this paper, we in-vestigate autonomic1 application-sensitive approaches for

1Autonomic computing is based on the strategies used by biologicalsystems to deal with complexity, heterogeneity, and uncertainty, and en-ables applications that are self-configuring, optimizing, self-healing, con-textually aware, and open.

Page 2: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

520 S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531

SAMR runtime management[7], enabling the efficient exe-cution of SAMR applications.

This paper presents the design, implementation, and eval-uation of ARMaDA, a self-adapting and optimizing parti-tioning framework for SAMR applications. The overall goalof ARMaDA is to manage dynamism and space-time hetero-geneity, and support the efficient and scalable execution ofSAMR implementations. It monitors and characterizes theruntime state of SAMR applications. It then selects, con-figures, and invokes appropriate partitioning and schedulingmechanisms to match the current state of the applicationand optimize its computational and communication perfor-mance. ARMaDA is composed of three key components: ap-plication state monitoring and characterization component,partitioner selection component and policy engine, and run-time adaptation component. Partitioning strategies used in-cludes a selection from broadly used software tools such asGrACE [14,15] and Vampire [21]. The advantages of theautonomic partitioning capabilities provided by ARMaDAare experimentally demonstrated using a selection of SAMRkernels.

ARMaDA builds on our previous work where we definedan approach for characterizing the state of SAMR applica-tions [6,22] and used this approach to experimentally studythe behavior of structured domain-based partitioners. In thispaper, we use these building blocks to construct an auto-nomic partitioning framework. This paper makes two keycontributions: (1) It defines mechanisms for monitoring, an-alyzing, and characterizing the state of distributed SAMRapplications at runtime, and (2) It defines policies for auto-nomic application-sensitive partitioner selection, configura-tion, and invocation to optimize the overall performance ofSAMR applications.

The rest of the paper is organized as follows. Section 2discusses related research and prior work. Section 3 presentsthe building blocks for constructing an autonomic parti-tioner for SAMR applications. Section 4 uses a “feasibil-ity study” example to illustrate the operation of adaptiveapplication-sensitive partitioning and to demonstrate its ad-vantages. Section 5 presents the design and implementationof the ARMaDA framework. Section 6 describes the exper-imental evaluation of ARMaDA using a selection of SAMRkernels. Section 7 presents concluding remarks and futurework.

2. Related work

Related research in adaptive partitioning and dynamicload balancing for adaptive mesh refinement (AMR) appli-cations is summarized in Table 1. Partitioners for AMR ap-plications not only have to balance load but also have toaddress locality, available parallelism, communication over-heads, data-migration overheads, partitioning quality, andrepartitioning overheads. Note that these factors often haveconflicting requirements and must be considered together.

Table 1Summary of related research in dynamic partitioning and load balancingfor AMR applications

Scheme (Type) DescriptionDynamic load balancing for unstructured meshes

PLUM using a cost-metric model with computation,(Unstructured) communication, and remapping weights; accepts new

partitioning if gain greater than redistribution cost

Unified Repartitioning for unstructured meshes based on aRepartitioning relative cost factor that minimizes bothAlgorithm communication and data redistribution costs; initial(Unstructured) partitioning scheme yielding lowest cost is selected

SAMR Dynamic Defines a moving-grid phase that balances load acrossLoad Balancing processors after adaptation, and splitting-grid phase(Structured) splits a grid along the longest dimension, if required

Partitioning Ranks partitioning methods based on problem space,Advisory machine speed, and network information; considersSystem processor performance variance assuming linear(Unstructured) complexity

Existing approaches typically consider only a subset of thesefactors. For example, certain techniques (such as SAMRDLB) only consider the computational workload while someothers (such as PLUM and URA) combine the load withcommunication and locality.

Note that SAMR is an alternative to the unstructured ap-proach and partitioners for unstructured AMR cannot betypically used for SAMR. Though there exist several adap-tive partitioning approaches for unstructured AMR, there areonly a few research efforts investigating runtime adaptationsfor SAMR. In this paper, we focus on the partitioners thathave been shown to be most appropriate for SAMR[22],i.e., those based on space filling curves. However, relatedwork for “unstructured methods” can provide useful insightsinto several runtime adaptation concepts, such as the metricand policy analysis and execution, while developing similarapproaches for SAMR.

The research presented in this paper proposes an auto-nomic partitioning solution for self-adapting and optimiz-ing SAMR application management. Our approach definesa multi-component objective function that addresses loadbalance, communication overheads, locality, data-movementoverheads, partitioning quality, and repartitioning overheads.Furthermore, the adaptation/optimization process is basedon the current state of the application and consists of dy-namically selecting the most appropriate partitioning algo-rithm and configuration that addresses the requirements ofthe current SAMR grid hierarchy to improves overall appli-cation performance.

2.1. PLUM

PLUM (Parallel Load balancing for adaptive UnstructuredMeshes) [13] is a dynamic load balancing strategy for adap-

Page 3: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531 521

tive unstructured grid computations that uses a cost-metricmodel for efficient mesh adaptation. In addition to a flowsolver and a mesh adaptor, PLUM includes a partitioner anda remapper that load balances and redistributes the compu-tational mesh, when necessary. Parallel adaptation consistsof three phases—initialization, execution, and finalization. Adual graph representation of the initial computational meshkeeps complexity and connectivity constant.

The PLUM model uses computation (Wcomp), commu-nication (Wcomm), and data-remapping (Wremap) weights toimplement metrics that estimate and compare the computa-tional gain and the redistribution cost due to load balancingafter each mesh adaptation step. The new partitioning andmapping are accepted if the computational gain is larger thanthe redistribution cost. The metrics used by PLUM are sim-ilar to the metrics proposed in this paper. However, we alsoconsider the speed of the partitioner, the quality of the par-tition, and partitioning overheads, which are not consideredby PLUM. Furthermore, in addition to deciding whether torepartition or not, we also dynamically select and configurethe appropriate partitioning algorithm.

2.2. Unified repartitioning algorithm

The Unified Repartitioning Algorithm (URA)[19] is aparallel adaptive repartitioning scheme for the dynamic loadbalancing of scientific simulations. A key parameter used inURA is the Relative Cost Factor that describes the relativetimes required for inter-processor communication and dataredistribution during load balancing. Using this parameter,it is possible to combine the two minimization objectives(namely, inter-processor communication and data redistribu-tion) of the adaptive graph partitioning problem into a pre-cise cost function. URA attempts to compute a repartition-ing while directly minimizing this cost function. URA hasthree phases: graph coarsening, initial partitioning, and un-coarsening/refinement. In the initial partitioning phase, thecoarsest graph is repartitioned twice using alternative meth-ods, which are optimized variants of the scratch-remap andglobal diffusion schemes. The cost functions are then com-puted for each of these schemes and the one with the lowestcost is selected. Note that, unlike our approach, URA ac-tually repartitions using each of the two available schemesand then chooses the one yielding the lowest cost.

2.3. Dynamic load balancing for SAMR

In investigating dynamic load balancing (DLB) schemesfor SAMR [11], Lan et al. analyzed application require-ments, identified four unique characteristics (viz., coarsegranularity, high dynamism, high imbalance, and differentdispersion), and provided an implementation that maintainsglobal information on each processor to manage the dy-namic grid hierarchy. Their DLB scheme is composed of twophases: moving-grid phase and splitting-grid phase. During

the “moving-grid phase”, after each adaptation, the DLB istriggered if load is not balanced, i.e.MaxLoad/AvgLoad>threshold. The MaxProc moves its grid directly toMin-Procunder the condition that the computational load of thisgrid does not exceed(threshold∗ AvgLoad− MinLoad),thereby ensuring that an under-loaded processor does not be-come overloaded. If imbalance still exists, the “splitting-gridphase” is invoked, which splits the grid along the longestdimension into two sub-grids. Eventually, either the load isbalanced or the minimum allowable grid size is reached.Both the moving-grid phase and splitting-grid phase mayexecute in parallel. Note that while this scheme does addressSAMR, it only addresses dynamic load balancing and notadaptive partitioning.

2.4. Partitioning advisory system

Crandall and Quinn have developed a partitioning advi-sory system[9] for networks of workstations. The advisorysystem has three built-in partitioning methods, viz., con-tiguous row, contiguous point, and block. Given informationabout the problem space, the machine speed, and the net-work, the advisory system provides a ranking of the threepartitioning methods. The advisory system takes into con-sideration the variance in processor performance among theworkstations but variance in network performance is notconsidered. The problem, however, is that linear computa-tional complexity is assumed for the application. This work,however, does not address AMR applications.

3. Enabling autonomic partitioning for SAMRapplications

A number of factors contribute to the overall performanceof SAMR applications. The relationships between the con-tributing factors are often complex and depend on dynamicinformation such as the current state of the application andthe system. Dynamic partitioning and load balancing forSAMR applications involves distributing application compo-nents on available resources to address parallelism, locality,and current computational, communication, and storage re-quirements. Partitioners typically optimize a subset of theserequirements at the expense of others. Hence, the most ap-propriate partitioner and associated partitioning parametersdepend on the application’s configuration and its runtimestate. Runtime partitioner selection, configuration, and op-timization requires dynamically identifying the most appro-priate combination of partitioner and parameters that matchthe current state of the application/system, and can be a sig-nificant challenge. This motivates the investigation of auto-nomic, self-adapting, and optimizing partitioning and run-time management solutions for SAMR applications. Such asolution consists of 3 steps:Monitoring and characterization: The application state

is monitored at regular intervals and application behavior

Page 4: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

522 S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531

I II

III IV

VI

VIIIVII

V

More computation

More communication

More localized adaption More scattered adaptation

Less computation

Less communication

Higher activity dynamics

Lower activity dynamics

Fig. 1. The octant approach[22] for characterizing application state basedon its adaptation pattern, computation/communication requirements, andactivity dynamics.

is abstracted in terms of a partitioning quality metric thatcombines computation/communication requirements, appli-cation dynamics, and nature of adaptation, as defined in thefollowing section.Analysis and deduction: Using the abstracted application

state and defined rules, constraints, and heuristics, the mostappropriate partitioning algorithm and associated partition-ing parameters (e.g., granularity) are selected at runtimefrom a family of available partitioners.Adaptation and optimization: The selected partitioner is

dynamically configured and invoked. Optimizations are per-formed to reduce the overheads due to monitoring and char-acterization and dynamic switching of partitioners.

The autonomic partitioning approach outlined abovebuilds on our prior work on application sensitive character-ization of domain-based partitioners[6,22] and applicationsensitive partitioning [5] for SAMR applications. In thissection, we first briefly introduce the key building blocksfor enabling autonomic SAMR partitioning, including thepartitioning quality metric, the characterization of domain-based partitioners, the octant approach for classifying ap-plication state, and the policies for mapping partitioners toapplication state.

3.1. Characterizing SAMR runtime state

The octant approach[22], shown in Fig. 1, provides ameans for characterizing the runtime state and correspond-ing partitioning requirements of a SAMR application. Ac-cording to the octant approach, application state is classifiedwith respect to (a) the adaptation pattern (scattered or local-ized), (b) whether runtime is dominated by computations orcommunications, and (c) the activity dynamics in the solu-tion, e.g., adaptation pattern changes rapidly. Applicationsmay start off in one octant and then migrate to others, as thesolution progresses.

3.2. Partitioning quality metric

The dynamic partitioning requirements for a SAMR ap-plication and the performance of a partitioner can be ab-stracted using the five-component quality metric[22] out-lined below.

1. Load imbalance: This component measures the load im-balance created by a partitioner. It occurs when any pro-cessor has more than average load and can cause perfor-mance to deteriorate rapidly. This metric component as-sumes greater significance in a computation-dominatedenvironment.

2. Communication requirement: This component is a mea-sure of the inter-processor communication required bythe SAMR algorithm for a given partitioning of theSAMR grid hierarchy, and is determined by the geo-metrical intersections between SAMR grid components.This metric component assumes greater significance in acommunication-dominated environment.

3. Data migration: This component is a measure of theamount of data that has to be moved between processorswhen the application transitions from one partitioning tothe next. It represents the ability of the partitioner to con-sider the existing distribution. This metric component as-sumes greater significance for highly dynamic applica-tions and scattered adaptations.

4. Partitioning induced overhead: This component attemptsto capture the quality of a partition using the numberof sub-grids and their aspect ratio. A large number ofsmall grids may produce better load balance, but the in-creased communication can cause performance to dete-riorate. Similarly, bad aspect ratio can adversely affectthe computation-to-communication ratio for a grid com-ponent.

5. Partitioning time: This component represents how fastthe partitioning algorithm computes the new partitioning,given an existing partitioning. It indicates the parallelefficiency of the partitioning algorithms employed.

Optimizing all the above components implies conflictingobjectives. For example, optimizing components (1) and (2)together constitutes anNP-hard problem. Partitioners typi-cally optimize a subset of the components at the expense ofothers. The quality metric enables the overall characteriza-tion of a partitioner and the definition of dynamic partition-ing requirements of the SAMR application.

3.3. Characterizing partitioner behavior

ARMaDA addresses application-sensitive partitioning forSAMR applications, especially those based on variants ofspace-filling curve[18] based techniques such as the onesfrom the GrACE distributed AMR infrastructure [14,15]and the Vampire SAMR partitioning library [21]. The threeprimary partitioners considered in the ARMaDA research

Page 5: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531 523

Table 2Recommendations for mapping octants onto partitioning schemes

Octant Scheme

I SFCII pBD-ISP, SFCIII G-MISP+SP, SP-ISP, SFCIV G-MISP+SP, SFCV pBD-ISPVI pBD-ISPVII G-MISP+SP, pBD-ISPVIII pBD-ISP

are—inverse space-filling curve-based “composite” parti-tioner (SFC) from GrACE,p-way binary dissection inversespace-filling curve partitioner (pBD-ISP) and geometricmultilevel inverse space-filling curve partitioner with se-quence partitioning (G-MISP+SP) from Vampire. A charac-terization[6] of these partitioners using the above metric issummarized as follows.SFC: The SFC scheme is fast, has average load balance,

and generates low overhead, low communication and aver-age data migration costs. This scheme is suited for applica-tion states associated with low-to-moderate activity dynam-ics and more computation than communication.G-MISP+SP: The G-MISP+SP scheme is aimed towards

speed and simplicity, and favors simple communication pat-terns and partitioning speed over amount of data migration.Sequence partitioning significantly improves the load bal-ance but can be computationally expensive.pBD-ISP: The pBD-ISP scheme is fast and the coarse

granularity partitioning generated results in low overheadsand communication costs. However, the overall load balanceis average and may deteriorate when refinements are stronglylocalized. This scheme is suited for application states withgreater communication and data requirements and lesser em-phasis on load balance.

3.4. Mapping partitioners to application state octants

Partitioners can be mapped to application state octantsbased on the metric components they optimize and theirability to satisfy the partitioning requirements of the octant.Such a mapping, using experiments and heuristics [6,22], issummarized in Table 2.

To illustrate the use of this mapping in partitioner selec-tion, consider an application that tracks an explosion, exe-cuting on a computer system with very low communicationbandwidth. If the application has high activity dynamics, itwill remain in “far part” of the cube (octants V–VIII). Theexplosion will give rise to many scattered clusters of adapta-tion, hence putting it in the “right part” of the cube (octantsVI and VIII). Due to the low communication bandwidth,the runtime will probably be dominated by communication.Therefore, it will end up in the “upper part” of the cube (oc-tant VI). Based on the available partitioners and their char-acterization summarized above, pBD-ISP meets the require-

ments of each of these octants and is a suitable choice forthese application states.

4. Adaptive application-sensitive partitioning: afeasibility study

As mentioned previously, the partitioning requirementsof a SAMR application depend on the application’s con-figuration and its runtime state. The objective of this sec-tion is to manually demonstrate the benefits of adaptiveapplication-sensitive partitioning based on a manual analy-sis of the SAMR application. In this section, we perform afeasibility study to illustrate the operation of such an adap-tive SAMR partitioner. The adaptation policies are manuallyformulated using a SAMR simulator[20] and these policiesare used to adaptively select and invoke the most suitablepartitioner from among the available SFC, G-MISP+SP, andpBD-ISP domain-based partitioners. The advantages andbenefits of adaptive partitioning are experimentally evalu-ated. The SAMR application used in this evaluation is abasic, rudimentary version of the RM3D compressible tur-bulence kernel (RM3Db) solving the Richtmyer–Meshkovinstability in 3-D. The Richtmyer–Meshkov instability is afingering instability which occurs at a material interface ac-celerated by a shock wave. This instability plays an impor-tant role in studies of supernova and inertial confinementfusion.

4.1. Charaterizing application state

Application characterization consists of two steps. First,the adaptive behavior of the application is captured in anadaptation trace generated using a single processor run. Theadaptation trace contains snap-shots of the SAMR grid hier-archy at each regrid step. This trace is then analyzed usingthe octant approach and the adaptive partitioning strategy isdefined. Note that while application characterization in thisexample is performed manually, this process is automatedin ARMaDA as described in Section 5.

The application is executed for 800 coarse level time-steps and the trace consists of over 200 snap-shots. A se-lection of these snap-shots is shown in Fig. 2 to illustratethe dynamics of theRM3Db application.RM3Db starts outwith a scattered adaptation, lower activity dynamics, andmore computation, placing it initially in octant IV. EitherG-MISP+SP or SFC is the most appropriate partitioner forthis octant. As the application evolves, its adaptation pat-terns cause it to move between octants and may require dif-ferent partitioning schemes. At regrid step 106, theRM3Db

application has high communication requirements, high dy-namics, and a scattered adaptation, placing it in octant VI.pBD-ISP is the partitioner of choice in this octant. At the endof the simulation, the application adaptations are localized,with increased computation and lower activity, placing the

Page 6: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

524 S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531

Fig. 2. Sequence of profile views at sampled regrid steps forRM3Db application.

application in octant III and G-MISP+SP is the appropriatepartitioner.

4.2. Formulating partitioner adaptation policy usingSAMR simulator

Once theRM3Db application trace containing snap-shotsof the SAMR grid hierarchy is obtained, a SAMR simulator[20] is used to analyze the behavior of a partitioner for theapplication on 64 processors. Note that the adaptation policypresented here is formulated manually based on a manualanalysis of the simulator data. In the discussion below, weonly present the simulator analysis for the SFC and pBD-ISPpartitioners. As the G-MISP+SP scheme has similar charac-teristics as the SFC partitioner, its analysis is not presented.The SAMR simulator uses the trace to determine the per-formance of the partitioners at each regrid step in terms ofthe primary components of the quality metric (communica-tion, data movement, load imbalance) presented in Section3.2. Note that the other two metric components, partitioningoverhead and partitioning time, are intrinsic characteristicsof a partitioner. The partitioning time is directly dependenton the three primary components that determine the runtimeperformance of the partitioner. Figs. 3–5 plot the maximumvalues of the total communication, data movement, and loadimbalance components of the quality metric respectively forSFC and pBD-ISP partitioners on 64 processors. Analyzingthese graphs, five operational zones can be identified.

1. ZoneI (0 to≈ 25): This is the application start-up zoneand is typically associated with low communication anddata migration. In this zone, we observe that the commu-nication and data movement costs for both partitionersare similar. The deciding factor then becomes the loadimbalance. Though SFC starts with high load imbalance,

this is quickly corrected to achieve improved load bal-ance. In this zone, the G-MISP+SP scheme starts withand maintains low imbalance. In contrast, the pBD-ISPscheme produces relatively high load imbalance in thiszone. Hence, G-MISP+SP or SFC is the preferred parti-tioner for this zone.

2. Zone II (≈ 25 to ≈ 135): In this zone, there are sub-stantial communication and data movement costs as theapplication evolves. The pBD-ISP scheme aims at re-ducing these costs while maintaining moderate-to-lowload imbalance. Though the SFC and G-MISP+SP par-titioners yield better load balance than pBD-ISP, theyproduce larger data movement and communication over-heads that adversely affect performance. Resolving theseconflicting objectives, the pBD-ISP scheme is expectedto perform better in this zone.

3. ZoneIII ( ≈ 135 to ≈ 150): This zone is characterizedby high activity dynamics and greater communicationrequirements. The SFC and G-MISP+SP schemes out-perform pBD-ISP in, both, the load balance and commu-nication metric components, while producing compara-ble data movement. Clearly, G-MISP+SP or SFC is theappropriate partitioner for this zone.

4. ZoneIV (≈ 150 to≈ 175): In this zone, pBD-ISP giveslower data movement and communication overheads andproduces an acceptable load imbalance as compared tothe G-MISP+SP or SFC scheme, making it the parti-tioner of choice for the zone.

5. ZoneV (≈ 175 to ≈ 200): This is the relatively “quiet”final stage of theRM3Db application. The data migrationand communication requirements are low and constant.Although all schemes perform equally well with respectto the quality metric, the G-MISP+SP or SFC schemeis slightly preferred as compared to pBD-ISP due to thebetter load balance produced.

Page 7: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531 525

Fig. 3. Simulator analysis: maximum total communication forRM3Db on 64 processors.

Fig. 4. Simulator analysis: maximum data movement forRM3Db on 64 processors.

Fig. 5. Simulator analysis: maximum load imbalance forRM3Db on 64 processors.

An application-sensitive partitioner adaptation policy(denoted by “adaptive” in this example) forRM3Db canbe formulated using the simulator analysis for application-sensitive partitioning as shown in Table3. The regrid stepis used as the switching point where a new partitioner isselected, configured, and invoked.

4.3. Experimental evaluation of the adaptive partitioner

The adaptive partitioning policy, manually formulatedabove, is experimentally evaluated on the NPACI IBM SP2

“Blue Horizon” at the San Diego Supercomputing Cen-ter, using theRM3Db application with a base grid of size128*32*32, 3 levels of factor 2 space–time refinements,and regridding performed every 4 time-steps. The exper-imental evaluation compares the overall execution timesfor application runs using individual partitioners as well asthe adaptive partitioning strategy outlined above. The par-titioner as well as the partitioning parameters are switchedon-the-fly during application execution. The results of theexperiments for 64 processors are listed in Table4. Note

Page 8: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

526 S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531

Table 3Policy formulation for adaptive partitioning forRM3Db application

Zone Starting Ending Partitioningregrid step regrid step strategy

I 0 22 G-MISP+SPII 23 134 pBD-ISPIII 135 148 G-MISP+SPIV 149 174 pBD-ISPV 175 201 G-MISP+SP

Table 4Adaptive partitioning evaluation forRM3Db application on 64 processors

Partitioner Execution Max. Load AMRtime (s) Imbalance (%) Efficiency (%)

SFC 484.502 24.88 98.82G-MISP+SP 405.062 11.32 98.78pBD-ISP 414.952 35.03 98.86“Adaptive” 352.824 8.12 98.76

that the “adaptive” evaluation in this example uses a com-bination of pBD-ISP and G-MISP+SP partitioners, ratherthan the SFC partitioner presented in the simulator analysis.This results in improved load balance during an actual ap-plication run with all processors utilized. The basicRM3Db

version used in this evaluation serves as a “feasibility study”example and is significantly different from the RM3D ker-nel evaluated using the ARMaDA framework in Section6.3, resulting in different execution times.

The above results demonstrate that an adaptive application-sensitive partitioning policy that dynamically switches par-titioners (and associated partitioning parameters) at runtimeimproves application performance and results in reduced ex-ecution times. For 64 processors, the improvement is 27.2%over the slowest partitioner. In the adaptive partitioner case,G-MISP+SP improves the load balance when the applica-tion is computationally dominated, while pBD-ISP reducescommunication and data-migration overheads. This “fea-sibility study” example demonstrates the potential benefitsof adaptive application-sensitive partitioning and motivatesthe development of the ARMaDA autonomic partitioningframework.

5. ARMaDA autonomic partitioning framework

ARMaDA is an autonomic partitioning framework forSAMR applications that dynamically selects and configurespartitioning algorithms at runtime to optimize overall ap-plication performance. The ARMaDA framework has threecomponents: application state monitoring and characteriza-tion component, partitioner selection component and policyengine, and runtime adaptation component. The state charac-terization component implements mechanisms that abstractand characterize the current application state in terms of

the computation/communication requirements, applicationdynamics, and the nature of the adaptation. The policy en-gine defines the associations between application state andpartitioner behavior. Using the characterized applicationstate and the defined policies, the appropriate partitioningalgorithm is selected at runtime from the available parti-tioners in the repository. The partitioners used includes aselection from software tools such as GrACE[14,15] andVampire [21]. The runtime adaptation component dynam-ically configures the selected partitioner with appropriatepartitioning parameters and invokes it at each regrid step.Note that ARMaDA does not require any a priori knowl-edge about the behavior of the application or its adaptationpatterns.

5.1. Automated application state monitoring andcharacterization

The current state of a SAMR application and its partition-ing requirements are determined using the structure of theSAMR grid hierarchy, which is expressed as sets of bound-ing boxes assigned to the different processors at variouslevels of refinement. These boxes are used to inexpensivelycompute the computation/communication requirements, ap-plication dynamics, and the nature of the adaptation usingsimple geometric operations as follows.

5.1.1. Computation/communication requirementsThe computation-to-communication ratio (CCratio) de-

termines whether the current application state is computa-tionally intensive or communication-dominated. Since it isquite challenging to determine the exact computation andcommunication requirements of the SAMR grid hierarchy atruntime, a simple approximation based on geometry of thestructured application domain is used to deduce these val-ues. The volume of the bounding boxes at different levelsof refinement gives an estimate of the computational workwhile the surface area gives an estimate of the communi-cation associated with the grid hierarchy. Consequently, theratio of the total volume to total surface area of the bound-ing boxes determines CCratio for the current state of theapplication.

CCratio=∑

(Volume of bounding boxes)∑

(Surface area of bounding boxes).

5.1.2. Application dynamicsThe application dynamics (Dynamics) estimate the rate

of change of the application refinement patterns and are de-termined by comparing the current application state withits previous state. Information about the previous applica-tion state is stored within the application state monitoringand characterization component. To compute the Dynam-ics metric, the overlap between the current set of bound-ing boxes and the previous set of boxes at each level is

Page 9: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531 527

computed. A greater overlap region indicates that the ap-plication has not changed significantly and has low activitydynamics. A smaller overlap region represents high activitydynamics.

Dynamics= Size of(Current state boxes

∩ Previous state boxes).

5.1.3. Nature of adaptationThe nature of the adaptation (Adapt) captures the adap-

tation pattern, i.e., whether the refinements are scatteredor localized. Scattered refinements can give rise to greateroverheads. Determining exact refinement patterns and theirlocation in the SAMR grid hierarchy can once again bequite complex. A simpler, approximate algorithm deter-mines the nature of adaptation based on the number andgeometric arrangement of refinement regions within theoverall domain under consideration. A higher number ofrefinement patches is an indicator of possibly scatteredrefinements. Since differences in adaptation overheads donot significantly affect partitioner performance, this simplescheme provides a reasonable assessment of the nature ofadaptation.

Adapt= Volume of refinement regions

Domain volume

∗Number of refinement patches.

5.1.4. Runtime state characterizationThe three metrics described above can now be used to

effectively estimate the state of the SAMR application atruntime. A key concern during partitioner selection andadaptation in ARMaDA is the possibility of thrashingdue to very frequent application state changes which, inturn, can result in frequent switching of partitioners andincreased overheads. To avoid thrashing, the ARMaDAframework maintains a history of the application state bystoring the structure of the grid hierarchy for two pre-ceding regrid steps. This 3-step sliding window (currentand two preceding) is used heuristically to detect transientchanges and avoid oscillations in application state by ana-lyzing the magnitude of the metric change (going low orhigh). Normalized application metric ratios are computed asfollows.

Let Mr denote any one of the three metrics that are com-puted from the state of the grid hierarchy at regrid stepr.The normalized metric ratio is then computed as

Mratio = currMr ∗ currMr−2

(currMr−1)2 .

Three normalized ratios are computed corresponding to thethree metrics, viz., computation/communication ratio “Cra-tio”, application dynamics ratio “Dratio”, and adaptation ra-tio “Aratio”.

5.2. Policy-based partitioner selection

The three normalized ratios (Cratio, Dratio, and Aratio)computed by the state characterization component are used,along with application state thresholds and application met-ric weights, to assign a bit to each metric and adjust metricvalues to closely match the needs of the dynamic applica-tion. The three bits are then combined to map the applica-tion to a specific application state octant. The appropriatepartitioner is then selected from the set of partitioners thatare mapped to the octant and satisfy the partitioning require-ments associated with that octant.

Application state thresholds are user-defined parametersand represent the lower and upper limiting values for thenormalized metric ratios for a particular application state.A small threshold value can lead to thrashing resulting infrequent state changes and switching of partitioners, caus-ing increased overheads and unstable adaptations. A largethreshold value can result in fewer state changes beingdetected and insufficient adaptations, causing degraded per-formance. In our evaluation, application state thresholdsranging between 20% and 50% provide reasonable overallperformance. The application metric weights are heuristicparameters that can be set by the user to fine-tune metricsensitivity when prior knowledge about the application char-acteristics is available. For example, in a computationallyintensive SAMR application, the weight for Cratio (com-putation/communication ratio) can be set higher than theweights for other metric ratios, making the partitioner selec-tion strategy more sensitive to changes in the computationalrequirements of the application. Default metric weightsof 1.0 can be used when this information is not known apriori.

Let Mratio andMbit represent the normalized ratio andthe octant-bit value for a particular metricM. Let lowMwtandhighMwtdenote the application metric weights for thelow and high application state thresholds (represented byLOW_THRESH and HIGH_THRESH), respectively. Themetric ratio to octant bit mapping is given by

Mbit = 0, if Mratio < (lowMwt ∗ LOW_THRESH)

= 1, if Mratio >= (highMwt ∗ HIGH_THRESH)

→ else, remain unchanged.

Three octant bits,“Cbit”, “Dbit”, and “Abit” correspond-ing to the three metric ratios are computed in this way. Alow Cbit (Cbit = 0) represents more communication whilea high Cbit (Cbit = 1) indicates greater computation. A lowDbit (Dbit = 0) indicates greater activity whereas a high

Page 10: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

528 S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531

Table 5Association of application state to octants and partitioners within theARMaDA framework

ARMaDA state Octant SuitableC D A position partitioner

0 0 0 V pBD-ISP0 0 1 VI pBD-ISP0 1 0 I SFC0 1 1 II pBD-ISP,

SFC1 0 0 VII G-MISP+SP,

pBD-ISP1 0 1 VIII pBD-ISP1 1 0 III G-MISP+SP,

SP-ISP, SFC1 1 1 IV G-MISP+SP,

SFC

Dbit (Dbit = 1) represents lesser application dynamics. Alow Abit (Abit = 0) denotes more localized adaptation whilea high Abit (Abit = 1) indicates that the nature of adap-tation is scattered. These bits are then used to identify theARMaDA characterized state, which is then mapped to anapplication state octant and associated partitioners using thepolicies as shown in Table5.

5.3. Adaptive application-sensitive partitioning

The ARMaDA adaptation component configures the se-lected partitioner with appropriate partitioning parameters(such as partitioning granularity) and invokes it to partitionand load balance the SAMR grid hierarchy. A common in-terface is defined for the available partitioners allowing themto be uniformly invoked. The partitioner parameters are se-lected based on the characterization of the application stateand the determination of the octant in which the applicationcurrently resides. The granularity is based on the require-ments of the octant, though it may be overridden by a userdefined value.

The overhead associated with ARMaDA is the effort re-quired to (a) determine and characterize the application stateby monitoring the SAMR grid hierarchy at runtime, (b) se-lect and configure the appropriate partitioner and partition-ing parameters based on the defined policy and thresholds,and (c) invoke the new partitioner to partition the SAMRgrid hierarchy. Note that these steps are performed at run-time when the application regrids between two phases ofcomputation.

A key concern in adaptive partitioning is ensuring thatthe overheads of characterizing the application state andswitching partitioners (if necessary) at runtime do not off-set the benefits of adaptation. ARMaDA uses efficient andinexpensive mechanisms based on the union and interac-tion of the structured geometry of the grid componentsto characterize application state and compute the requiredmetric ratios. Furthermore, ARMaDA provides optimiza-

tions and control parameters that can be used to customizethe sensitivity, quality, and overheads of application mon-itoring and partitioner adaptation. For example, if theapplication state remains unchanged for a specific period(defined by the parameter SAME_REGRID_LIMIT), AR-MaDA can be directed to stop sensing application state fora certain number of regrid steps (defined by the parame-ter SKIP_SAME_REGRID), thereby reducing monitoringoverheads. Similarly, when a change in runtime state causesa dynamic switching of partitioners, it is assumed thatapplication characteristics do not change significantly inthe immediately succeeding regrids. In this case, runtimestate sensing is paused for SKIP_SWITCH steps. Thesethree optimization parameters (SAME_REGRID_LIMIT,SKIP_SAME_REGRID, and SKIP_SWITCH) are user-defined and attempt to control adaptation overheads. Verysmall or large values of these parameters can, respectively,lead to excessive or insufficient adaptation, resulting in de-graded application performance. In our evaluation, 2 or 4regrid steps are reasonable parameter values for optimiza-tion.

6. Evaluation of ARMaDA framework

The experimental evaluation of the ARMaDA frameworkis performed using various SAMR application kernels. Theexperiments consist of measuring overall application exe-cution times for different partitioner configurations usingthe ARMaDA framework. The partitioners used in the eval-uation include SFC, G-MISP+SP, pBD-ISP, and adaptivecombinations of these partitioners determined at runtimebased on application state. With the exception of the parti-tioning strategy and associated partitioning granularity, allapplication-specific and refinement-specific parameters arekept constant.

In the adaptive partitioning experiments using ARMaDA,an initial partitioner is selected by the user along with otherpartitioning parameters and thresholds. The initial partitioneris used for the initial distribution of the SAMR grid hierar-chy, while the partitioners used in subsequent distributionsare autonomically determined byARMaDA. The experimen-tal results demonstrate that autonomic application-sensitivepartitioning using ARMaDA can improve application per-formance as compared to non-adaptive partitioning. The re-sults also show that the choice of the initial partitioner doesnot significantly affect the performance of the application asARMaDA dynamically adapts to select the most appropriatepartitioner based on the application state and its partitioningrequirements.

Note that the feasibility study presented in Section4 usesan application trace and its characterization to manually de-fine the runtime adaptation policy which yields benefits.Since ARMaDA automates the characterization and adap-tation processes, a priori knowledge of the runtime charac-teristics of the SAMR application is not required to make

Page 11: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531 529

Table 6ARMaDA evaluation for TportAMR-2D application on 8 processors on“Discover”

Partitioner Executiontime (s)

SFC alone (no ARMaDA) 352.091ARMaDA with SFC start 349.286G-MISP+SP alone (no ARMaDA) 360.946ARMaDA with G-MISP+SP start 353.44and thrashingpBD-ISP alone (no ARMaDA) 344.558ARMaDA with pBD-ISP start 342.646

runtime decisions—ARMaDA autonomically adapts and op-timizes the partitioner.

6.1. TportAMR-2D on “Discover”

“Discover” is a 16-node Linux Beowulf cluster compris-ing 8 dual processor Pentium IIIs at The Applied SoftwareSystems Laboratory at Rutgers University. The TransportAMR 2D (TportAMR-2D) application is a benchmarkkernel solving the transport equation and is primarilycommunication-dominated with high adaptation overheads.This evaluation is performed on 8 processors of Discoverusing an application base grid of size 64*64 and 3 levels offactor 2 space–time refinements. Regridding is performedevery 4 time-steps at each level and the application executesfor 40 coarse level time steps. The experiments compareapplication runs using each individual partitioner in “standalone” mode without ARMaDA and as initial partitionerswith ARMaDA. The application execution times are listedin Table6.

The TportAMR-2D application used in this experimentis computationally inexpensive but requires considerablecommunication and data movement. Furthermore, the ex-periment uses a small domain and few iterations. Sincethe pBD-ISP partitioner is particularly suited to stronglycommunication-dominated application states and reducescommunication and data migration costs [22], it is expectedto perform better than the other partitioners. The results inTable 6 show that the ARMaDA framework with pBD-ISPas the initial partitioner performs the best and its perfor-mance is similar to the stand alone pBD-ISP case. Whena different initial partitioner is used, the ARMaDA frame-work dynamically switches to pBD-ISP and results in aslightly better performance than the stand alone case. Inthis evaluation, the overheads associated with the ARMaDAframework range between 40 and 100 ms for an entire runof approximately 350 s. Though thrashing is detected in thecase of G-MISP+SP with switching between G-MISP+SPand pBD-ISP partitioners, its effects are not significant dueto the ARMaDA optimizations described earlier. Due tothe application characteristics, adaptive partitioning doesnot produce significantly different execution times in thisevaluation.

Table 7ARMaDA evaluation for VectorWave-2D application on 32 processors on“Frea”

Partitioner Execution time (s)

SFC 637.478G-MISP+SP 611.749pBD-ISP 592.05ARMaDA with SFC start 470.531

6.2. VectorWave-2D on “Frea”

“Frea” is a 64-node Linux Beowulf cluster also at TheApplied Software Systems Laboratory at Rutgers University.The VectorWave-2D application is a coupled set of partialdifferential equations and forms a part of the Cactus 2-Dnumerical relativity toolkit, solving Einstein’s and gravita-tional equations. This evaluation is performed on 32 proces-sors of Frea using an application base grid of size 128*128and 3 levels of factor 2 space–time refinements. Regriddingis performed every 4 time-steps at each level and the appli-cation runs for 60 coarse level time steps. The applicationexecution times for the individual partitioners and for AR-MaDA with SFC start are presented in Table7.

The VectorWave-2D application is primarily computation-dominated, requiring good load balance and reduced com-munication and data migration costs. SFC and pBD-ISP par-titioners optimize communication and data migration met-ric components while G-MISP+SP gives good load balance.In this evaluation, the ARMaDA partitioner with SFC startimproves performance by 26.19% over the slowest parti-tioner. The ARMaDA framework overhead for state sensingand adaptation in this case is 0.0616% of the total executiontime, which is negligible.

6.3. RM2D & RM3D on “Blue Horizon”

The NPACI IBM SP2 “Blue Horizon” is a Teraflop-scalePower3 based clustered SMP system at the San Diego Su-percomputing Center. RM2D and RM3D are 2-D and 3-Dversions, respectively, of a compressible turbulence kernelsolving the Richtmyer–Meshkov instability. This evaluationis performed on 64 processors of Blue Horizon. The RM2Devaluation uses a base grid of size 128*32 and 60 coarse leveltime steps, while RM3D uses a base grid of size 128*32*32and 100 coarse level time steps. Both evaluations use 3 lev-els of factor 2 space–time refinements and regridding is per-formed every 4 time-steps at each level. The application ex-ecution times for the individual partitioners and ARMaDAfor RM2D and RM3D are listed in Tables 8 and 9, respec-tively.

Both, RM2D and RM3D start out as computation-dominated with low activity dynamics and a more scatteredadaptation, placing them in octant IV. As the applicationsevolve, their communication requirements become more

Page 12: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

530 S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531

Table 8RM2D execution times for ARMaDA partitioners on 64 processors on“Blue Horizon”

Partitioner Execution time (s)

SFC 264.041G-MISP+SP 214.745pBD-ISP 199.738ARMaDA with G-MISP+SP start 190.431

Table 9RM3D execution times for ARMaDA partitioners on 64 processors on“Blue Horizon”

Partitioner Execution time (s)

SFC 5065.51pBD-ISP 4016.91ARMaDA with SFC start 3126.3

Table 10ARMaDA adaptive policy for RM3D application on 64 processors on“Blue Horizon”

Iteration RM3D State Octant Appropriatenumber C D A position partitioner

1–6 1 1 1 IV SFC7–16 0 1 1 II pBD-ISP17–26 1 1 0 III SFC27–36 0 1 1 II pBD-ISP37–46 1 1 0 III SFC47–56 0 1 1 II pBD-ISP57–66 1 0 0 VII SFC67–84 0 0 1 VI pBD-ISP84–100 1 1 0 III SFC

significant. Moreover, the applications exhibit differentdynamics and adaptations at different stages of their ex-ecution. The pBD-ISP scheme improves communicationand data migration metric components and is seen to per-form better than the SFC and G-MISP+SP partitioners.The ARMaDA adaptive partitioner starts out with eitherthe G-MISP+SP or SFC partitioner as these partition-ers are better suited for the initial stages of the applica-tion. As an illustrative example, the ARMaDA settingsused for RM3D evaluation are SAME_REGRID_LIMIT= 4, SKIP_SAME_REGRID = 4, SKIP_SWITCH = 4,LOW_THRESH = 0.6, HIGH_THRESH = 1.4, and metricweights (Cwt, Dwt, Awt) = 1.0. The adaptive partitionerselection used by ARMaDA in case of RM3D is listed inTable10.

The overheads associated with the ARMaDA frameworkare low. For the RM2D and RM3D evaluations, the over-heads are 0.415 and 1.85 s, respectively, and are negligi-ble as compared to the overall application execution times.These experimental results demonstrate that autonomic par-titioning using the ARMaDA framework reduces executiontime and improves application performance. In the caseof RM2D application, improvements due to ARMaDA are

4.66%, 11.32%, and 27.88% over pBD-ISP, G-MISP+SP,and SFC partitioners, respectively. In the case of RM3D ap-plication, these improvements are 22.17% over the pBD-ISPpartitioner and 38.28% over the SFC scheme. Note that theRM3D kernel used in this evaluation is significantly differ-ent from the basicRM3Db version evaluated using a manu-ally orchestrated policy in Section4.3, exhibiting differentphysics and, hence, different execution times.

7. Conclusion and future work

This paper presents the design, implementation, and eval-uation of ARMaDA, an autonomic application-sensitive par-titioning framework for SAMR applications. The overallgoal of ARMaDA is to manage the dynamism and space-time heterogeneity, and support the efficient and scalable ex-ecution of SAMR implementations for dynamically adaptivescientific and engineering simulations. TheARMaDA frame-work provides mechanisms for automatically characterizingthe dynamic application state at runtime, and abstractingthe current computational, communication, and storage re-quirements. This runtime state is then used by ARMaDAto dynamically select, configure, and invoke the appropri-ate partitioning strategy from an available repository of par-titioners. The performance of the ARMaDA framework isexperimentally evaluated using various SAMR applicationkernels. The experimental results validate the advantages ofadaptive application-sensitive partitioning and the ARMaDAautonomic partitioning framework.

The ARMaDA framework focuses on application-sensitive adaptations and assumes uniform system charac-teristics (no system adaptations considered). Since a fullyautonomic system must have awareness of the executionenvironment also, we plan to investigate runtime machinecharacteristics as part of the future work. The frameworkcould be extended to include system-sensitive capabilitiesso as to use information about the CPU, memory, and band-width availabilities of the computing elements to performbetter adaptations. Permutations of these policies along withsystem-sensitive rules could be applied at runtime to accu-rately model the behavior and optimize the performance ofSAMR applications.

Acknowledgments

The authors would like to thank Johan Steensland foraccess to Vampire in this research. The authors are grateful toRavi Samtaney and Mijan Huq for making the applicationsavailable for use.

References

[1] ASCI/ASAP Center,http://www.cacr.caltech.edu/ASAP, CaliforniaInstitute of Technology.

Page 13: Towardsautonomicapplication-sensitivepartitioning ...parashar/Papers/armada-jpdc-05.pdf · ARMaDA builds on our previous work where we defined an approach for characterizing the

S. Chandra, M. Parashar / J. Parallel Distrib. Comput. 60 (2005) 519–531 531

[2] M. Berger, G. Hedstrom, J. Oliger, G. Rodrigue, Adaptive MeshRefinement for 1-Dimensional Gas Dynamics, Scientific Computing(IMACS/North-Holland Publishing), 1983, pp. 43–47.

[3] M. Berger, J. Oliger, Adaptive Mesh Refinement for HyperbolicPartial Differential Equations, J. Comput. Phys. 53 (1984) 484–512.

[4] G. Bryan, Fluids in the Universe: Adaptive Mesh Refinement inCosmology, Comput. Sci. Eng. (March-April 1999) 46–53.

[5] S. Chandra, M. Parashar, ARMaDA: An Adaptive Application-Sensitive Partitioning Framework for SAMR Applications, in:Proceedings of the IASTED International Conference on Parallel andDistributed Computing and Systems, Cambridge, MA, November2002.

[6] S. Chandra, M. Parashar, An Evaluation of Partitioners for ParallelSAMR Applications, in: R. Sakellariou, J. Keane, J. Gurd, L. Freeman(Eds.), Proceedings of the 7th International Euro-Par Conference,Lecture Notes in Computer Science, vol. 2150, Springer, Berlin,August 2001, pp. 171–174.

[7] S. Chandra, M. Parashar, S. Hariri, GridARM: An AutonomicRuntime Management Framework for SAMR Applications in GridEnvironments, New Frontiers in High-Performance Computing,Proceedings of the Autonomic Applications Workshop, 10thInternational Conference on High Performance Computing (HiPC),Hyderabad, India, Elite Publishing, December 2003, pp. 286–295.

[8] M. Choptuik, Experiences with an Adaptive Mesh RefinementAlgorithm in Numerical Relativity, in: C. Evans, L. Finn, D. Hobill(Eds.), Frontiers in Numerical Relativity, Cambridge University Press,London, 1989, pp. 206–221.

[9] P. Crandall, M. Quinn, A Partitioning Advisory System for NetworkedData-parallel Processing, Concurrency: Practice and Experience 7(5) (August 1995) 479–495.

[10] S. Hawley, M. Choptuik, Boson Stars Driven to the Brink of BlackHole Formation, Phys. Rev. D 62 (2000) 104024.

[11] Z. Lan, V. Taylor, G. Bryan, Dynamic Load Balancing for StructuredAdaptive Mesh Refinement Applications, in: Proceedings of the 30thInternational Conference on Parallel Processing, Valencia, Spain,2001.

[12] M. Norman, G. Bryan, Cosmological Adaptive Mesh Refinement,Numerical Astrophysics 1998, Kluwer Academic PublishersDordrecht, The Netherlands, 1999.

[13] L. Oliker, R. Biswas, PLUM: Parallel Load Balancing for AdaptiveUnstructured Meshes, J. Parallel Distrib. Comput. 52 (2) (1998) 150–177.

[14] M. Parashar,http://www.caip.rutgers.edu/∼parashar/TASSL/Projects/GrACE, GrACE homepage, 2001.

[15] M. Parashar, J. Browne, C. Edwards, K. Klimkowski, A CommonData Management Infrastructure for Adaptive Algorithms for PDESolutions, in: Proceedings of the Supercomputing Conference,ACM/IEEE Computer Society, San Jose, CA, November 1997.

[16] M. Parashar, J. Wheeler, G. Pope, K. Wang, P. Wang, ANew Generation EOS Compositional Reservoir Simulator: PartII—Framework and Multiprocessing, in: Proceedings of the Societyof Petroleum Engineering Reservoir Simulation Symposium, Dallas,TX, June 1997.

[17] R. Pember, J. Bell, P. Colella, W. Crutchfield, M. Welcome, AdaptiveCartesian Grid Methods for Representing Geometry in InviscidCompressible Flow, in: Proceedings of the 11th AIAA ComputationalFluid Dynamics Conference, Orlando, FL, July 1993.

[18] H. Sagan, Space-filling Curves, Springer, Berlin, 1994.[19] K. Schloegel, G. Karypis, V. Kumar, A Unified Algorithm for Load-

balancing Adaptive Scientific Simulations, in: Proceedings of theInternational Conference on Supercomputing, Dallas, TX, November2000.

[20] M. Shee, Evaluation and Optimization of Load Balancing/Distribution Techniques for Adaptive Grid Hierarchies, M.S. Thesis,Graduate School, Rutgers University, NJ, 2000.

[21] J. Steensland,http://www.caip.rutgers.edu/∼johans/vampire, Vampirehomepage, 2000.

[22] J. Steensland, S. Chandra, M. Parashar, An Application-CentricCharacterization of Domain-Based SFC Partitioners for ParallelSAMR, IEEE Transactions on Parallel and Distributed Systems,Vol. 13, No. 12, IEEE Computer Society Press, Silver Spring, MO,December 2002, pp. 1275–1289.

[23] P. Wang, I. Yotov, T. Arbogast, C. Dawson, M. Parashar, K.Sepehrnoori, A New Generation EOS Compositional ReservoirSimulator: Part I—Formulation and Discretization, in: Proceedingsof the Society of Petroleum Engineering Reservoir SimulationSymposium, Dallas, TX, June 1997.

Sumir Chandra is a Ph.D. student in theDepartment of Electrical and ComputerEngineering at Rutgers, The State Univer-sity of New Jersey. He received a B.E.degree in Electronics Engineering fromBombay University, India, and a M.S. de-gree in Computer Engineering from RutgersUniversity. Sumir is a researcher at TheApplied Software Systems Laboratory atthe Center for Advanced Information Pro-cessing at Rutgers. His research interestsinclude parallel and distributed computing,autonomic computing, scientific computing,

software engineering, runtime management, performance evaluation andprediction. Sumir received the Best Research Poster Award at the Su-percomputing Conference at Denver, USA, in 2001. He has co-authored12 technical papers in international journals and conferences, has co-authored/edited 3 book chapters/proceedings, and has reviewed severaltechnical contributions in prestigious journals and conferences. Additionalinformation is available athttp://www.caip.rutgers.edu/∼sumir/.

Manish Parashar is Associate Professorof Electrical and Computer Engineeringat Rutgers University, where he is alsodirector of the Applied Software SystemsLaboratory. He received a BE degree inElectronics and Telecommunications fromBombay University, India, and MS andPh.D. degrees in Computer Engineeringfrom Syracuse University. He has receivedthe NSF CAREER Award and the EnricoFermi Scholarship from Argonne NationalLaboratory. His current research interestsinclude parallel and distributed computing

(including autonomic, Grid and peer-to-peer computing) with applica-tions to computational science and engineering, networking and soft-ware engineering. Manish is a member of the executive committee ofthe IEEE Computer Society Technical Committee on Parallel Processing(TCPP), part of the IEEE Computer Society Distinguished Visitor Pro-gram (2004-2006), and a member of ACM. He is also the co-founderof the IEEE International Conference on Autonomic Computing (ICAC).Manish has co-authored over 140 technical papers in international jour-nals and conferences, has co-authored/edited 6 books/proceedings, andhas contributed to several others in the area of computational scienceand parallel and distributed computing. For more information please visithttp://www.caip.rutgers.edu/∼parashar/.