rescheduling sathish vadhiyar. rescheduling motivation heterogeneity and contention can cause...

ReschedulingRescheduling

Sathish VadhiyarSathish Vadhiyar

Rescheduling MotivationRescheduling Motivation

Heterogeneity and contention can Heterogeneity and contention can cause application’s performance vary cause application’s performance vary over timeover time

Rescheduling decisions in response Rescheduling decisions in response to changes in resource performance to changes in resource performance is necessaryis necessary Performance degradation of the running Performance degradation of the running

applicationsapplications Availability of “better” resourcesAvailability of “better” resources

Modeling the Cost of RedistributionModeling the Cost of Redistribution

CCthresholdthreshold depends on: depends on: Model accuracyModel accuracy Load dynamics of the systemLoad dynamics of the system

Modeling the Cost of RedistributionModeling the Cost of Redistribution

Redistribution Cost Model for Redistribution Cost Model for Jacobi 2DJacobi 2D

• Emax – average iteration time of the processor that is farthest behind

• Cdev – processor performance deviation variable

Redistribution Cost Model for Redistribution Cost Model for Jacobi 2DJacobi 2D

ExperimentsExperiments

8 processors were used8 processors were used

A loading event consisting of parallel A loading event consisting of parallel program was introduced 3 minutes program was introduced 3 minutes after Jacobi startedafter Jacobi started

Number of tasks of the loading event Number of tasks of the loading event variedvaried

CCthresholdthreshold – 15 seconds – 15 seconds

ResultsResults

Malleable JobsMalleable Jobs

Parallel JobsParallel Jobs Rigid – only one set of processorsRigid – only one set of processors Moldable – flexible during job starts, but Moldable – flexible during job starts, but

cannot be reconfigured during executioncannot be reconfigured during execution Malleable – flexible during job start as Malleable – flexible during job start as

well as during executionwell as during execution

Rescheduling in GrADSRescheduling in GrADS•Performance-oriented migration framework

•Tightly coupled policies for suspension and migration

•Takes into account load characteristics, remaining execution times

•Migration of application depends on:

•The amount of increase or decrease in loads on the system

•The time of the application execution when load is introduced into the system

•The performance benefits that can be obtained due to migrationComponents:

1. Migrator

2. Contract Monitor

3. Rescheduler

SRS Checkpointing LibrarySRS Checkpointing Library

End application instrumented with user-level checkpointing libraryEnd application instrumented with user-level checkpointing libraryEnables reconfiguration of executing applications across distinct Enables reconfiguration of executing applications across distinct domainsdomainsAllows fault toleranceAllows fault toleranceUses IBP (Internet Backplane Protocol) for storage and retrieval of Uses IBP (Internet Backplane Protocol) for storage and retrieval of checkpointscheckpointsNeeds Runtime Support System (RSS) – an auxiliary daemon that is Needs Runtime Support System (RSS) – an auxiliary daemon that is started with the parallel applicationstarted with the parallel applicationSimple APISimple API

- SRS_Init()- SRS_Init() - SRS_Restart_Value()- SRS_Restart_Value() - SRS_Register()- SRS_Register() - SRS_Check_Stop()- SRS_Check_Stop() - SRS_Read()- SRS_Read() - SRS_Finish()- SRS_Finish() - SRS_StoreMap(), SRS_DistributeFunc_Create(), - SRS_StoreMap(), SRS_DistributeFunc_Create(),

SRS_DistributeMap_Create()SRS_DistributeMap_Create()

SRS INTERNALSSRS INTERNALSMPI Application

SRS

IBP IBP IBP

Runtime SupportSystem (RSS)

Start

PollSTOP

STOP

Read with possible redistribution ReStart

SRS APISRS API/* begin code *//* begin code */

MPI_Init()MPI_Init()

/* initialize data *//* initialize data */

loop{loop{

}}

MPI_Finalize()MPI_Finalize()

/* begin code *//* begin code */

MPI_Init()MPI_Init()SRS_Init()SRS_Init()

restart_value = restart_value = SRS_Restart_Value()SRS_Restart_Value()

if(restart_value == 0){if(restart_value == 0){ /* initialize data *//* initialize data */}}else{else{ SRS_Read(“data”, data, BLOCK, NULL)SRS_Read(“data”, data, BLOCK, NULL)}}

SRS_Register(“data”, data, SRS_INT, data_size, BLOCK, NULL)SRS_Register(“data”, data, SRS_INT, data_size, BLOCK, NULL)

loop{loop{ stop_value = SRS_Check_Stop()stop_value = SRS_Check_Stop() if(stop_value == 1){if(stop_value == 1){ exit();exit(); }}}}

SRS_Finish()SRS_Finish()MPI_Finalize()MPI_Finalize()

Original code SRS Instrumented code

SRS Example – Original CodeSRS Example – Original Code MPI_Init(&argc, &argv);MPI_Init(&argc, &argv);

local_size = global_size/size;local_size = global_size/size;

if(rank == 0){if(rank == 0){

for(i=0; i<global_size; i++){for(i=0; i<global_size; i++){

global_A[i] = i;global_A[i] = i;

}}

}}

MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm);MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm);

iter_start = 0;iter_start = 0;

for(i=iter_start; i<global_size; i++){for(i=iter_start; i<global_size; i++){

proc_number = i/local_size;proc_number = i/local_size;

local_index = i%local_size;local_index = i%local_size;

if(rank == proc_number){if(rank == proc_number){

local_A[local_index] += 10;local_A[local_index] += 10;

}}

}}

MPI_Finalize();MPI_Finalize();

SRS Example – Modified CodeSRS Example – Modified Code MPI_Init(&argc, &argv);MPI_Init(&argc, &argv);

SRS_Init();SRS_Init();

local_size = global_size/size;local_size = global_size/size;

restart_value = SRS_Restart_Value();restart_value = SRS_Restart_Value();

if(restart_value == 0){if(restart_value == 0){

if(rank == 0){if(rank == 0){

for(i=0; i<global_size; i++){for(i=0; i<global_size; i++){

global_A[i] = i;global_A[i] = i;

}}

}}

MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm);MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm);

iter_start = 0;iter_start = 0;

}}

else{else{

SRS_Read(“A”, local_A, BLOCK, NULL);SRS_Read(“A”, local_A, BLOCK, NULL);

SRS_Read(“iterator”, &iter_start, SAME, NULL);SRS_Read(“iterator”, &iter_start, SAME, NULL);

}}

SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL);SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL);

SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);

SRS Example – Modified SRS Example – Modified Code (Contd..)Code (Contd..)

for(i=iter_start; i<global_size; i++){for(i=iter_start; i<global_size; i++){

stop_value = SRS_Check_Stop();stop_value = SRS_Check_Stop();

if(stop_value == 1){if(stop_value == 1){


exit(0);exit(0);

}}

proc_number = i/local_size;proc_number = i/local_size;

local_index = i%local_size;local_index = i%local_size;

if(rank == proc_number){if(rank == proc_number){

local_A[local_index] += 10;local_A[local_index] += 10;

}}

}}

SRS_Finish();SRS_Finish();


Components (Continued..)Components (Continued..)Contract Monitor:

» Monitors the progress of the end application» Tolerance limits specified to the contract

monitor» Upper contract limit – 2.0» Lower contract limit – 0.7

» When it receives the actual execution time for an iteration from the application» calculates ratio between actual and

predicted» Adds it to the average ratio» Adds it to the last_5_avg

Contract MonitorContract Monitor

If average ratio > upper contract limitIf average ratio > upper contract limit Contact reschedulerContact rescheduler Request for reschedulingRequest for rescheduling Receive replyReceive reply If reply is “SORRY. CANNOT RESCHEDULE”If reply is “SORRY. CANNOT RESCHEDULE”

Calculate new_predicted_time based on last_5_avg Calculate new_predicted_time based on last_5_avg and orig_predicted_timeand orig_predicted_timeAdjust upper_contract_limit based on Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, new_predicted_time, prev_predicted_time, prev_upper_contract_limitprev_upper_contract_limitAdjust lower_contract_limit based on Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, new_predicted_time, prev_predicted_time, prev_lower_contract_limitprev_lower_contract_limitprev_predicted_time = new_predicted_timeprev_predicted_time = new_predicted_time

Contract MonitorContract Monitor

If average ratio < lower contract limitIf average ratio < lower contract limit Calculate new_predicted_time based on Calculate new_predicted_time based on

last_5_avg and orig_predicted_timelast_5_avg and orig_predicted_time Adjust upper_contract_limit based on Adjust upper_contract_limit based on

new_predicted_time, prev_predicted_time, new_predicted_time, prev_predicted_time, prev_upper_contract_limitprev_upper_contract_limit

Adjust lower_contract_limit based on Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, new_predicted_time, prev_predicted_time, prev_lower_contract_limitprev_lower_contract_limit

prev_predicted_time = new_predicted_timeprev_predicted_time = new_predicted_time

ReschedulerRescheduler

A metascheduling serviceA metascheduling serviceOperates in 2 modesOperates in 2 modes When contract monitor requests for When contract monitor requests for

rescheduling – i.e. during performance rescheduling – i.e. during performance degradationdegradation

Periodically queries Database manager Periodically queries Database manager for recently completed GrADS for recently completed GrADS applications, migrates executing applications, migrates executing applications to make use of freed applications to make use of freed resources – i.e. opportunistic resources – i.e. opportunistic reschedulingrescheduling

Rescheduler Pseudo CodeRescheduler Pseudo Code

Rescheduler pseudo CodeRescheduler pseudo Code

Application and Metascheduler InteractionsApplication and Metascheduler Interactions

User

ResourceSelection

RequestingPermission

PermissionService

Permission?

Application SpecificScheduling

ContractDevelopment

ContractNegotiator

ContractApproved?

ApplicationLaunching

Problem parameters

Initial list of machines

PermissionNO

YES

Abort

Exit

Get new resource information

Application specific schedule


NOYES

ApplicationCompletion?

Application Completed

Wait for restartsignal

Application was stopped

Problem parameters, final schedule Get new resource

information

Rescheduler ArchitectureRescheduler ArchitectureApplicationLaunching

ExitApplicationCompletion?

Application Completed

Wait for restartsignal

Application was stopped


Application Manager

ApplicationContractMonitor

RuntimeSupportSystem(RSS)

Execution time

Query for STOP signal

DatabaseManager

ReschedulerRequest for migration

Store STOP

Send STOP signal

Store RESUME

Static Rescheduling CostStatic Rescheduling CostRescheduling PhaseRescheduling Phase Time (seconds)Time (seconds)

Writing checkpointsWriting checkpoints 4040

Waiting for NWS updateWaiting for NWS update 9090

NWS retrieval timeNWS retrieval time 120120

Application-level schedulingApplication-level scheduling 8080

Other Grid overheadOther Grid overhead 1010

Starting applicationStarting application 6060

Reading checkpoints and data Reading checkpoints and data redistributionredistribution

500500

TotalTotal 900900

Experiments and ResultsExperiments and ResultsRescheduling on requestRescheduling on request

Different problem sizes of ScaLAPACK QRDifferent problem sizes of ScaLAPACK QRmsc – fast machines; opus – slow machinesmsc – fast machines; opus – slow machinesInitial set of resources consisted of 4 msc and 8 Initial set of resources consisted of 4 msc and 8 opus machinesopus machinesThe performance model always chose 4 msc The performance model always chose 4 msc machines for application runmachines for application run5 minutes into the application run, artificial load 5 minutes into the application run, artificial load is introduced on 4 msc machinesis introduced on 4 msc machinesThe application migrated from UT to UIUCThe application migrated from UT to UIUC

No rescheduling

Rescheduling

Rescheduler decided not to reschedule for size

8000.Wrong decision!

Rescheduling Depending on Amount of LoadRescheduling Depending on Amount of Load

ScaLAPACK QR problem size – 12000ScaLAPACK QR problem size – 12000

Load introduced 20 minutes after Load introduced 20 minutes after application startapplication start

The amount of load was variedThe amount of load was varied

No rescheduling

Rescheduling

Rescheduler decided not to reschedule.Wrong decision!

Rescheduling Depending on Load Introduction TimeRescheduling Depending on Load Introduction Time

ScaLAPACK QR problem size – 12000ScaLAPACK QR problem size – 12000

Same load introduced at different points of Same load introduced at different points of application executionapplication execution

No reschedulingRescheduling

Rescheduler decided not to reschedule.Wrong decision!

Experiments and Results Experiments and Results Opportunistic ReschedulingOpportunistic Rescheduling

Two problems –Two problems – - 1- 1stst problem, size 14000 executing on 6 problem, size 14000 executing on 6 mscmsc machines. machines. - 2- 2ndnd problem of varying sizes. problem of varying sizes.

2nd problem introduced 2 minutes after the start of 12nd problem introduced 2 minutes after the start of 1stst problem.problem.Initial set of resources for the 2Initial set of resources for the 2ndnd problem consisted of 6 problem consisted of 6 mscmsc machines and 2 machines and 2 opusopus machines. machines.Due to the presence of 1Due to the presence of 1stst problem, the 2 problem, the 2ndnd problem had to problem had to use both the use both the mscmsc and and opusopus machines, hence involved machines, hence involved Internet bandwidth.Internet bandwidth.After 1After 1stst problem completes, the 2 problem completes, the 2ndnd problem can be problem can be rescheduled to use only the rescheduled to use only the mscmsc machines. machines.

Large problem

Large problem

No rescheduling

No rescheduling

Large problem

Large problem

No rescheduling

No rescheduling


Dynamic Prediction of Dynamic Prediction of Rescheduling CostRescheduling Cost

The rescheduler, during rescheduling The rescheduler, during rescheduling decision, contacts RSS and obtains decision, contacts RSS and obtains data distributions of datadata distributions of data

Forms old and new data mapsForms old and new data maps

Based on maps and current NWS Based on maps and current NWS information, predicts redistribution information, predicts redistribution costcost

Dynamic Prediction of Dynamic Prediction of Rescheduling CostRescheduling Cost

Application started on: 4 mscs

Application restarted on: 8 opus

References / Sources / creditsReferences / Sources / credits

Predicting the Cost of Redistribution in SchedulingPredicting the Cost of Redistribution in Schedulingby by Gary Shao, Rich WolskiGary Shao, Rich Wolski and and Fran BermanFran BermanProceedings of the 8th SIAM Conference on Parallel Processing for Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific ComputingScientific ComputingVadhiyar, S. and Dongarra, J. “Vadhiyar, S. and Dongarra, J. “Performance Oriented Migration Performance Oriented Migration Framework for the GridFramework for the Grid”. ”. Proceedings of The 3rd IEEE/ACM Proceedings of The 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid International Symposium on Cluster Computing and the Grid (CCGrid 2003)(CCGrid 2003), pp 130-137, May 2003, Tokyo, Japan., pp 130-137, May 2003, Tokyo, Japan.L. V. Kale, Sameer Kumar, and J. DeSouzaL. V. Kale, Sameer Kumar, and J. DeSouzaA Malleable-Job System for Timeshared Parallel Machines A Malleable-Job System for Timeshared Parallel Machines 2nd IEEE/ACM International Symposium on Cluster Computing and 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), May 21-24, 2002, Berlin, Germany. the Grid (CCGrid 2002), May 21-24, 2002, Berlin, Germany. See Cactus migration thornSee Cactus migration thornSee opportunistic migration by HuedoSee opportunistic migration by Huedo

JUNK !JUNK !

GridWayGridWay

Migration:Migration: When performance degradation happensWhen performance degradation happens When “better” resources are discoveredWhen “better” resources are discovered When requirements changeWhen requirements change Owner decisionOwner decision Remote resource failureRemote resource failure

Rescheduling done at discovery intervalRescheduling done at discovery intervalPerformance degradation evaluator Performance degradation evaluator program executed at monitoring intervalprogram executed at monitoring interval

ComponentsComponents Request managerRequest manager Dispatch managerDispatch manager Submission manager – prologing, submitting, Submission manager – prologing, submitting,

canceling, epilogingcanceling, epiloging Performance monitorPerformance monitor

Application specific componentsApplication specific components Resource selectorResource selector Performance degradation evaluatorPerformance degradation evaluator PrologProlog WrapperWrapper epilogepilog

Opportunistic Job MigrationOpportunistic Job Migration

FactorsFactors Performance of new hostPerformance of new host Remaining execution time of applicationRemaining execution time of application Proximity of new resource to the needed dataProximity of new resource to the needed data

Dynamic Space sharing on clusters Dynamic Space sharing on clusters of non-dedicated workstations of non-dedicated workstations

(Chowdhury et. al.)(Chowdhury et. al.)Dynamic reconfiguration – application Dynamic reconfiguration – application level approach for dynamic reconfiguration level approach for dynamic reconfiguration of grid-based iterative applicationsof grid-based iterative applications

SRS OverheadSRS Overhead

Worst case Overhead – 15%Worst case SRS Overhead of all results – 36 %

SRS Data Redistribution CostSRS Data Redistribution Cost

Started on – 8 MSCsRestarted on – 8 OPUS, 2MSCs

GridRoutine /

ApplicationManager

User

Modified GrADS ArchitectureModified GrADS Architecture

ResourceSelector

PerformanceModeler

ContractDeveloper

AppLauncher

ContractMonitor

Application

MDS

NWSPermission

Service

RSS

ContractNegotiator

Rescheduler

DatabaseManager

Another approach: AMPIAnother approach: AMPI

AMPI – MPI implementation on top of Charm++AMPI – MPI implementation on top of Charm++Processes implemented as user-level threadsProcesses implemented as user-level threadsCharm++ provides load balancing framework, Charm++ provides load balancing framework, migrates threadsmigrates threadsThe load balancing framework accepts The load balancing framework accepts processor mapprocessor mapParallel job started on all processors in the Parallel job started on all processors in the systemsystemAllocates work to only processors in the Allocates work to only processors in the processor map, i.e. threads/objects are assigned processor map, i.e. threads/objects are assigned to processors in the processor mapto processors in the processor map


When processor map changesWhen processor map changes Threads are migrated to new set of Threads are migrated to new set of

processors in the processor mapprocessors in the processor map Skeleton processes left behind in the vacated Skeleton processes left behind in the vacated

processorsprocessors A skeleton forwards messages to A skeleton forwards messages to

threads/objects previously housed in the threads/objects previously housed in the processorprocessor

New processor conveyed to load balancer New processor conveyed to load balancer framework by adaptive job schedulerframework by adaptive job scheduler

OverheadOverhead

Shrink or expand time depends on:Shrink or expand time depends on:per-process data that has to be transferredper-process data that has to be transferredNumber of processors involvedNumber of processors involved

Cost of skeleton processCost of skeleton process

CPU utilization by 2 JobsCPU utilization by 2 Jobs

Adaptive Job SchedulerAdaptive Job Scheduler

Variant of dynamic equipartitioning strategyVariant of dynamic equipartitioning strategyEach job specifies min. and max. number of Each job specifies min. and max. number of procs. that it can run on.procs. that it can run on.The scheduler recalculates the number of procs. The scheduler recalculates the number of procs. assigned to each running jobassigned to each running jobRunning jobs and new job are first assigned the Running jobs and new job are first assigned the minimum requirementminimum requirementThe left over procs. are equally divided among The left over procs. are equally divided among all the jobsall the jobsThe new job is assigned to a queue if it cannot The new job is assigned to a queue if it cannot be allocated its minimum requirementbe allocated its minimum requirement

SchedulingScheduling

Same strategy followed when jobs Same strategy followed when jobs completecomplete

The scheduler conveys the decision by bit-The scheduler conveys the decision by bit-vector to jobsvector to jobs

Jobs do thread migrationJobs do thread migration

ExperimentsExperiments

32 processor Linux cluster32 processor Linux clusterJob arrival by Poisson processJob arrival by Poisson processEach job – a molecular dynamics (MD) program Each job – a molecular dynamics (MD) program with 50,000 atoms with different number of with 50,000 atoms with different number of iterationsiterationsNumber of iterations exponentially distributedNumber of iterations exponentially distributedMinimum number of procs., minpe – uniformly Minimum number of procs., minpe – uniformly distributed between 1 and 64distributed between 1 and 64maxpe – 64maxpe – 64Each experiment – 50 job arrivalsEach experiment – 50 job arrivals

ResultsResults

Load factor – mean arrival rate x (execution time on 64 processors)

Dynamic reconfigurationDynamic reconfiguration

Ability to change number Ability to change number of processors during of processors during executionexecution

Condor like environmentCondor like environment Respect ownerships of Respect ownerships of

workstationsworkstations Provide high performance Provide high performance

for parallel applicationsfor parallel applications

Dynamic reconfiguration Dynamic reconfiguration also provides high also provides high throughput for the systemthroughput for the system

rescheduling sathish vadhiyar. rescheduling motivation heterogeneity and contention can cause...

Documents