scheduling for grid computing

Scheduling for Grid Computing

龚　斌山东大学计算机科学与技术学院

山东省高性能计算中心

Reference• Fangpeng Dong and Selim G.Akl ： Scheduling Alg

orithms for Grid Computing : State of the Art and Open Problems

• Yanmin ZHU : A Survey on Grid Scheduling Systems

• Peter Gradwell : Overview of Grid Scheduling Systems

• Alain Andrieux et al : Open Issues in Grid Scheduling• Jia yu and Rajkumar Buyya : A Taxonomy of Workfl

ow Systems for Grid Computing

什么是网格？网格（ Grid ）是构筑在 Internet 上的一组新

兴技术，它将高速互联网、高性能计算机、大型数据库、传感器、远程设备等融为一体，为科技人员和普通老百姓提供更多的资源。 Internet 主要为人们提供E-mail 、网页浏览等通信功能，而网格功能更多更强，能让人们透明地使用计算、存储、信息处理等其他资源。

1998, The Grid: Blueprint for a New Computing Infrastructure.

Ian Foster : 美国阿岗国家实验室资深科学家、美国计算网格项目负责人

The Definition of Grid

• A type of parallel and distributed system that enables the sharing, selection, and aggregation of geographically distributed autonomous and heterogeneous resources dynamically at runtime depending on their availability, capability, performance, cost and users’ quality-of-service requirements

Characteristics of Grid Computing

• Exploiting underutilized resources

• Distributed supercomputing capability

• Virtual organization for collaboration

• Resource balancing

• Reliability

Class of Grid Computing

• Function:– Computing Grid– Data Grid– Service Grid

• Size:– IntraGrid– ExtraGrid– InterGrid

Traditional Parallel Scheduling Systems

• System:– SMP : 对称多处理，共享内存– Cluster ：机群– CC-NUMA: SGI

• Scheduling Systems:– OpenPBS, LSF, SGE ， Loadlevel, Condor,etc

…

Cluster Scheduling

The Assumption Underlying Tradition

Systems • All resources reside within a single administrative domain.• To provide a single system image, the scheduler controls

all of the resources.• The resource pool is invariant.• Contention caused by incoming application can be

managed by the scheduler according to some policies, so that its impact on the performance that the site can provide to each application can be well predicted.

• Computation and their data reside in the same site or data staging is a highly predictable process, usually from a predetermined source to a predetermined destination, which can be viewed as constant overhead.

Characteristics of Cluster Scheduling

• Homogeneity of resource and application

• Dedicated resource

• Centralized scheduling architecture

• High-speed interconnection network

• Monotonic performance goal

年代

The Terms of Grid Scheduling• A task is an atomic unit to be scheduled by the scheduler and a

ssigned to a resource.• The properties of a task are parameters like CPU/memory requ

irement, deadline, priority, etc.• A job (or metatask, or application) is a set of atomic tasks that

will be carried out on a set of resources. Job can have a recursive structure, meaning that jobs are composed of sub-jobs and /or tasks, and sub-jobs can themselves be decomposed further into atomic tasks.

• A resource is something that is required to carry out an operation, for example: a processor for data processing, a data storage device, or a network link for data transporting.

• A site (or node) is an autonomous entity composed of one or multiple resources.

• A task scheduling is the mapping of tasks to a selected group of resources which may be distributed in administrative domains.

Three Stages of Scheduling Process

• Resource discovering and filtering

• Resource selecting and scheduling according to certain objectives

• Job submission

Stages of SuperScheduling

• Resource Discovery– Authorization Filtering– Application requirement definition– Minimal requirement filtering

• System Selection– Gathering information (query)– Select the system (s) to run on

• Run Job– (optional) Make an advance reservation– Submit job to resources– Preparation Tasks– Monitor progress (maybe go back to System Selection)– Find out J is done– Completion tasks

Grid Scheduling framework

• Application Model– Extracts the characteristics of applications to be scheduled.

• Resource Model– Describes the characteristics of the underlying resources in Grid

systems.

• Performance Model– Responsible for behavior of a specific job on a specific

computation resource.

• Scheduling Policy– Responsible for deciding how applications should be executed and

how resources should be utilized.

Applications Classification

• Batch vs. Interactive

• Real-time vs. Non real-time

• Priority

Resources Classification

• Time-shared vs. Non time-shared

• Dedicated vs. Non-dedicated

• Preemptive vs. Non-preemptive

Performance Estimation

• Simulation

• Analytical Modeling

• Historical Data

• On-line Learning

• Hybrid

Scheduling Policy

• Application-centric– Execution Time : the time duration spent executing the job– Wait Time : the time duration spent waiting in the ready queue– Speedup : the ratio of time spent executing the job on the original

platform to time spent executing the job on the Grid.– Turnaround Time : also called response time. It is defined as the sum of

waiting time and executing time.– Job Slowdown : it is defined as the ratio of the response time of a job to

its actual run time.• System-centric

– Throughput : the number of jobs completed in one unit of time, such as per hour or per day.

– Utilization : the percent of time a resource is busy.– Flow Time : the flow time of a set of jobs is the sum of completion time

of all jobs.– Average Application performance.

Scheduling Strategy

• Performance-driven• Market-driven• Trust-driven

– Security policy– Accumulated reputation– Self-defense capability– Attack history– Site vulnerability （弱点、攻击）

A logical Grid scheduling architectureBroken lines : resource or application information flowsReal lines : task or task scheduling command flows

Grid Scheduler

• Grid Scheduler (GS) receives application from Grid users, select feasible resources for these application according to acquired information from the Grid Information Service module, and finally generates application-to-resource mappings based on certain objective functions and predicted resource performance.– GS usually cannot control Grid resources directly, but work like br

oker or agents– Metascheduler, SuperScheduler– Is not an indispensable component in the Grid infrastructure. Not i

ncluded in the Globus Tookit– In reality multiple such schedulers might be deployed, and organiz

ed to form different structures (centralized, hierarchical and decentralized) according to different concerns, such as performance or scalability.

Grid Information Service (GIS)

• To provide such information to Grid schedulers• GIS is responsible for collecting and predicting the resourc

e state information, such as CPU capacities, memory size, network bandwidth, software availabilities and load of a site in a particular period.

• GIS can answer queries for resource information or push information subscribers

• Globus : Monitoring and Discovery System (MDS)• Application profiling (AP) is used to extract properties of a

pplications• Analogical Benchmarking (AB) provides a measure of ho

w well a resource can perform a given type of job.

Launching and Monitoring (LM)

• Binder• Implements a finally-determined schedule b

y submitting applications to selected resources, staging input data and executables if necessary, and monitoring the execution of the applications

• Globus :Grid Resource Allocation and Management, GRAM

Local Resource Manager (LRM)

• Is mainly responsible for two jobs: local scheduling inside a resource domain, where not only jobs from exterior Grid users, but also jobs from the domain’s local users are executed, and reporting resource information to GIS.

• Open PBS, Condor ， LSF ， SGE ， etc• NWS : Network Weather Service, Hawkeye, Gang

lia

Evaluation Criteria for Grid Scheduling Systems

• Application Performance Promotion

• System Performance Promotion

• Scheduling Efficiency

• Reliability

• Scalability

• Applicability to Application and Grid Environment

Scheduler Organization

• Centralized

• Decentralized

• Hierarchical

Centralized Scheduling

Decentralized Scheduling

Hierarchical Scheduling

Existing Grid Scheduling Systems

• Information Collection Systems– MDS (Meta Directory Service)– NWS (Network Weather Service)

• Condor• Condor-G• AppLeS• Nimrod-G• GRaDS• Etc…

Characteristics of scheduling for Grid Computing

• Heterogeneity and Autonomy– Does not have full control of the resources– Hard to estimate the exact cost of executing a task on different sites.– Is required to be adaptive to different local policies

• Performance Dynamism– Grid resources are not dedicated to a Grid application– Performance fluctuation, compared with traditional system– Some methods: QoS negotiation, resource reservation, rescheduling

• Resource Selection and Computation-Data Separation– In tradition systems, executable codes of application and input/output data

are usually in the same site, or the input sources and output destinations determined before the application is submitted, The cost of data staging can be neglected.

• Application Diversity

Grid Scheduling Algorithms

• The Complexity of a general scheduling problem is NP-Complete.

• The scheduling problem becomes more challenging because of some unique characteristics belonging to Grid computing.

A Hierarchical taxonomy for scheduling algorithm

A Taxonomy of Grid Scheduling Algorithm

• Local vs. Global– Grid scheduling falls into the global scheduling.

• Static vs. Dynamic– Both static and dynamic scheduling are widely

adopted in Grid computing

Static Scheduling

• Every task comprising the job is assigned once to a resource, the placement of an application is static, and a firm estimate of the cost of the computation can be made in advance of the actual execution.

• Easier to program from a scheduler’s point of view• Rescheduling mechanism are introduced for task

migration.• Another side-effect is that the gap between static

scheduling and dynamic scheduling becomes less important.

Dynamic Scheduling

• Online Scheduling• Two Major components : system state estimation ,

decision making.• Advantage : the system need not be aware of the run-time

behavior of the application before execution• The primary performance : maximizing resource

utilization, rather than minimizing runtime for individual jobs.

• Four basic approaches:– Unconstrained FIFO– Balance-constrained techniques– Cost-constrained techniques– Hybrid of static and dynamic techniques

Unconstrained FIFO

• The resource with the currently shortest waiting queue or the smallest waiting queue time is selected for the incoming task

• Opportunistic Load Balancing (OLB), or myopic ( 近视 ) algorithm

• Simplicity, but far from optimal

Balance-constrained

• Attempts to rebalance the loads on all resources by periodically shifting waiting tasks from one waiting queue to another.

• The rebalance only happens inside a “neighborhood” where all resources are better interconnected.

• Advantages:– The initial loads can be quickly distributed to all resources and

started quickly– The rebalancing process is distributed and scalable– The communication delay of rebalancing can be reduced since task

shifting only happens among the resources that are “close” to each other

Cost-constrained

• Not only considers the balance among resources but also the communication cost between tasks

• Instead of doing a task exchange periodically, tasks will be checked before their move.

• This approach is more efficient than the previous one when the communication costs among resources are heterogeneous and communication cost to execute the application is the main consideration

• It is also flexible, and can be used with other cost factors such as seeking lowest memory size or lowest disc drive activity, and so on.

Hybrid

• A further improvement is the static-dynamic hybrid scheduling

• Is to take the advantages of static schedule and at the same time capture uncertain behaviors of applications and resources.

• For example, in those cases where there are special QoS requirements in some tasks, the static phase can be used to map those task with QoS requirements, and dynamic scheduling can be used for the remaining tasks.

A Taxonomy of Grid Scheduling Algorithm (cont.)

• Optimal vs. Suboptimal– Some criterion : minimum makespan , maximum resour

ce utilization

– Makespan : is the time spent from the beginning of the first task in a job to the end of the last task of the job.

– The NP-Complete nature of scheduling algorithms

– Current research tries to find suboptimal solutions


• Approximate vs. Heuristic– Approximate

• Use formal computational models, but instead of searching the entire solution space for an optimal solution

• The factor:– Availability of a function to evaluate a solution– The time required to evaluate a solution– The ability to judge the value of an optimal solution according to some

metric.– Availability of a mechanism for intelligently pruning the solution space.

– Heuristic• Represents the class of algorithms which make the most realistic

assumptions about a priori knowledge concerning process and system loading characteristics.

• are more adaptive to the Grid scenarios where both resources and applications are highly diverse and dynamic


• Distributed vs. Centralized– The centralized strategy has the advantage of

ease of implementation, but suffers from the lack of scalability, fault tolerance and the possibility of becoming a performance bottleneck.


• Cooperative vs. Non-cooperative– In the non-cooperative case, individual schedulers act

alone as autonomous entities and arrive at decisions regarding their own optimum objects independent of the effects of the decision on the rest of system.

– In the cooperative case, each Grid scheduler has the responsibility to carry out its own portion of the scheduling task, but all schedulers are working toward a common system-wide goal.

Objective Functions

• The two major parties in Grid computing– Resource consumers who submit various application– Resources Providers who share their resources

• Application Centric– Makespan– Economic Cost

• Resource Centric– Resource Utilization– Economic Profit

Application-Centric

• Aim to optimize the performance of each individual application, as application-level schedulers do.

• time : makespan

• Grid economic model : economic cost

• QoS : Quality of Services

Resource-Centric

• Aim to optimize the performance of the resources• Throughput : which is the ability of a resource to process a

certain number of jobs in a given period.• Utilization : which is the percentage of time a resource is b

usy• Grid economic Model : Economic Profit• TPCC : Total Processor Cycle Consumption, which is the t

otal number of instructions the grid could compute from the starting time of executing the schedule to the completion.– Represents the total computing power consumed by an application– Advantage : it can be little affected by the variance of resource per

formance, yet still related to the makespan.

Adaptive Scheduling

• The demand for scheduling adaptation comes from:– The heterogeneity of candidate resources– The dynamism of resource performance– The diversity of applications

• Resource Adaptation• Dynamic Performance Adaptation• Application Adaptation

Resource Adaptation

• Su et al : show how the selection of a data storage site affects the network transmission delay.

• Dail et al : proposed a resource selection algorithm– Available resources are grouped first into disjoint subsets according to the

network delays between the subsets– Inside each subset, resources are ranked according to their memory size a

nd computation power– An appropriately-size resource group is selected from the sorted lists

• Subhlok et al : show algorithms to jointly analyze computational and communication resource for different application demands and a framework for automatic node selection– The algorithm are adaptive to demands like selecting a set of nodes to ma

ximize minimum available bandwidth between any pair of nodes and selecting a set of nodes to maximize the minimum available fractional compute and communication capacities.

Dynamic Performance Adaptation

• The adaptation of the dynamic performance of resources is :– Changing scheduling policies or rescheduling– Workload distributing according to application-specific

performance models– Finding a proper number of resources to be used

• Usually adopt some kind of divide-and conquer approach– Parameter sweep applications– Data stripe processing

• Cluster-aware Random Stealing (CRS)– Allows an idle resource steal jobs not only from the local cluster

but also from remote ones with a very limited amount of wide-area communication

Application Adaptation

• Dial et al : explicitly decouple the scheduler core from application-specific and platform-specific components used by the core.

• Aggarwal et al : resource reservation• Wu et al : give a very good example of how a self-adaptive

scheduling algorithm cooperates with long-term resource performance prediction.– The algorithm is adaptive to indivisible single sequential jobs, jobs

that can be partitioned into independent parallel tasks, and jobs that have a set of indivisible tasks.

– When prediction error of the system utilization id reaching a threshold, the scheduler will try to reallocate tasks.

Task Dependency of an Application

• Independent– Static– Dynamic

• Dependent– Static

• List Algorithm• Cluster Algorithm• Duplication-based Algorithm

– Dynamic– Static Enhanced by Dynamic Rescheduling

Independent Task Scheduling

• Algorithms with Performance Estimate– MET– MCT– Min-min– Max-min– Xsuffrage– Task Grouping

• Algorithms without Performance Estimate

MET Algorithm

• Minimum Execution Time• Assigns each task to the resource with the best

expected execution time for that task, no matter whether this resource is available or not at the present time.

• The motivation behind MET is to give each task its best machine

• This can cause a severe load imbalance among machine

MET AlgorithmFor each arrived task S[k]

for each host H[j] in Heterogeneous Machines Set H

查找出最小的 E[k,j] 以及获得该最小值的机器 H[t]

endfor

更新机器就绪时间： r[t]=r[t]+E[k,t]

endforS[k]：任务集合H[j]：机器集合E[k,j]：任务 S[k]在机器 H[j]上的期望执行时间R[t]：机器 H[j]的期望就绪时间

MCT Algorithm

• Minimum Completion Time• Assigns each task, in an arbitrary order, to the

resource with the minimum expected completion time for that task.

• This cause some tasks to be assigned to machines that do not have the minimum execution time for them.

• The intuition behind MCT is to combine the benefits of opportunistic load balance (OLB) and MET, while avoiding the circumstances in which OLB and MET perform poorly.

MCT AlgorithmFor each arrived task S[k] for each host H[j] in heterogeneous Machines Set H 计算预测完成时间： C[k,j]=E[k,j]+r[j] 查找出最小的 C[k,j] 以及获得该最小值的机器 H[t]

endfor 更新机器就绪时间： r[t]=r[t]+E[k,t]endfor

C[k,j]：任务 S[k]在机器 H[j]上的期望完成时间

Min-min Algorithm

• Algorithm:– Begins with the set U of all unmapped tasks– The set of minimum completion time from M for each task in U is

found– The task with the overall minimum completion time from M is

selected and assigned to the corresponding machine– The newly mapped task is removed from U, and the process

repeats until all tasks are mapped (i.e., U is empty).

• Based on the minimum completion time, as is MCT.

Min-min Algorithm

For all task S[k] in scheduling-set SS for all machines H[j] in heterogeneous host set H C[k,j]=E[k,j]+r[j]// 计算全部任务在每一个机器上的期望完成时间Do until all tasks in SS are mapped for each task in SS// 找出全部未映射任务在每个机器上的最小完成时间及

其机器 find the earliest (minimum) completion time and the host that obtains it endfor // 在所有未映射任务的最小完成时间中找出最小值机器获得该值的机器 H[j] find task S[k] with the minimum earliest completion time assign task S[k] to the host H[j] that gives the earliest completion time delete task S[k] from SS and update r[j] Update C[k,j] for all host H[j]enddo

Max-min Algorithm

• Algorithm : is very similar to Min-min.– Begins the set U of all unmapped tasks

– The set of minimum completion time M is found

– The task with the overall maximum from M is selected and assigned to the corresponding machine

– The newly mapped task is removed from U, and the process repeats until all tasks are mapped

• Min-min and Max-min algorithms are simple and can be easily amended to adapt to different scenarios– X. He et al : is presents a QoS Guided Min-min

heuristic, can guarantee the QoS requirement of particular tasks and minimum the makespan at the same time.

– Wu, Shu and Zhang : gave a Segmented Min-min algorithm.

Max-Int Algorithm(最大时间跨度算法 )

For all task S[k] in scheduling-set SS for all machines H[j] in heterogeneous host set H C[k,j]=E[k,j]+r[j]// 计算全部任务在每一个机器上的期望完成时间Do until all tasks in SS are mapped for each task in SS find the earliest (minimum) completion time C[k,m] and the host H[m] that

obtains it find second earliest completion time C[k,n] and the host H[n] that obtains it 计算 Interval ： I[k]=C[k,n]-C[k,m] ，并将 I[k] 作为向量 I 的一个

元素 endfor // 从全部任务的时间间隔 I 中，找出具有最小时间间隔的任务 S[t] for all task S[k] in SS find the task S[t] with the maximum Interval assign task S[t] to the host H[m] that gives the earliest completion time delete task S[t] from SS and update r[j] Update C[t,m] for all host Henddo

Max-Int Algorithm

• 吸取 Min-min 和 Max-min 算法的优点，除利用历史调度信息，还利用预测信息减少调度任务时间

• 未来调度总是趋向最佳

Suffrage algorithm

• 一个资源将被分配给这样的一个作业，如果作业不分配到该节点上，将会蒙受最大的损失。

• 每个作业有一个 sufferage 值，定义在该任务的最好完成时间和它的次好完成时间之间， sufferage 值高的作业有优先权。

Algorithmfor 作业集合 T 中所有的作业 for 所有网格节点 mj

ckj=ekj+rj

do until T 中所有任务映射 for 作业集合 T 中所有的作业 tk

寻找具有最早完成时间的 mj sufferage value= 次好完成时间 - 最好完成时间 if mj 没有指派指派 tk给 mj，从 T 中删除 tk，标记 mj为已经指派 else if 已经指派给 mj的 tk的 sufferage 小于 tk的 sufferage value 取消 ti的指派，把 ti放回 T 中，指派 tk给 mj，从

T 中删除 tk

endfor 基于指派给机器的作业更新向量 r ，更新 c 矩阵enddo

Task Grouping

• Some cases in which applications with a large number of lightweight jobs. The overall processing of these applications involves a high overhead cost in terms of scheduling and transmission to or from Grid resources.

• Muthuvelu et al : propose a dynamic task grouping scheduling algorithm to deal with these cases.– Once a set of fine grained tasks are received, the scheduler groups

them according to their requirements for computation and the processing that a Grid resource can provide in a certain time period.

– All tasks in the same group are submitted to the same resource which can finish them all in the given time.

– The overhead for scheduling and job launching is reduced and resource utilization is increased.

Algorithms without Performance Estimate

• Do not use performance estimate but adopt the idea of duplication, which is feasible in the Grid environment where computational resources are usually abundant but mutable.

• Subramani et al : a simple duplication scheme– Distributes each job to the K least load sites– Each of these K sites schedules the job locally– When a job is able to start at any of the sites, the site informs the s

cheduler at the job-originating site, which in turn contacts the other K-1 sites to cancel the jobs from their respective queue.

• Silva et al : Workqueue with Replication (WQR)

Dependent Task Scheduling

• Directed Acyclic Graph (DAG) – Node represents a task.– Directed edge denotes the precedence orders be

tween its two vertices.– In some cases, weights can be added to nodes a

nd edges to express computational costs and communication costs respectively

• Condor DAGMan, CoG, Pegasus, GridFlow, ASKALON

Grid Systems Supporting Dependent Task Scheduling

• To run a workflow in a Grid :– How the tasks in the workflow are scheduled.

Grid workflow generators– How to submit the scheduled tasks to Grid

resources without violating the structure of the original workflow. Grid workflow engines

Taxonomy of Algorithms for Dependent Task Scheduling

• List Heuristics– Heterogeneous Earliest-Finish-Time, HEFT– Fast Critical Path, FCP

• Duplication Based Algorithms– Task Duplication-based Scheduling, TDS– Task duplication-based scheduling Algorithm for

Network of Heterogeneous systems, TANH• Clustering Heuristics

– Dominant Sequence Clustering, DSC– CASS-II

Data Scheduling

• In high energy physics, bioinformatics, and other disciplines, there are application involving numerous, parallel tasks that both access and generate large data sets, sometime in the petabyte range.

• Remote data storage, access management, replication services, and data transfer protocol.

Park et al’s model of cost measured in makespan

• Local Data and Local Execution

• Local Data and Remote Execution

• Remote Data and Local Execution

• Remote Data and Same Remote Execution

• Remote Data and Different Remote Execution

On Data Replication

• When the scheduling problem with data movement is considered, there are two situation : whether data replication is allowed or not.

• In Pegasus, the CWG assumes that accessing an existing dataset is always more preferable than a new one, when it maps an abstract workflow to a concrete one.

• Ranganthan et al : view data sets in the Grid as a tiered system and use dynamic replication strategies to improve data access.

On Computation and Data Scheduling

• When the interaction of computation scheduling and data scheduling is considered, we can also find two different of approaches – Decoupling computation from data scheduling– Conducting a combination scheduling

Non-traditional Approaches for Grid Task Scheduling

• Grid Economy – Economic Cost/Profit Considered– None Economic Cost/Profit Considered

• Nature’s Heuristics– Genetic Algorithms– Simulated Annealing– Tabu Serarch

Scheduling under QoS Constains

• Ina distributed heterogeneous non-dedicated environment, quality of services (QoS) is a big concern of many application. The meaning of QoS can be varied according to the concerns of different users. It could be a requirement on the CPU speed, memory size, bandwidth, software version or deadline.

• In general, QoS is not the ultimate objective of an application, but a set of conditions to run application successfully.

Strategies Treating Dynamic Resource Performance

• On-Time-Information from GIS

• Performance Prediction Based on GIS, Historical Record and Workload Modeling– On Prediction Accuracy– Prediction Based on Historical Record– Prediction Based on Workload Modeling

• Rescheduling

Open Issues On the Grid Scheduling

• Application and Enhancement of Classic Heterogeneous Scheduling Algorithms in Grid Environment

• New Algorithms Utilizing Dynamic Performance Prediction

• New Rescheduling Algorithms Adaptive Performance Variation

• New Algorithms under QoS Constraints• New Algorithms Considering Combined Computation and

Data Scheduling• New Problems Introduced by New Models• New Algorithm Utilizing the Grid Resource Overlay Struct

ure

如何阅读科技论文• 学会多种方法查找有关文献

– Google+ 关键词– 文献引用– 著名专家网页– 博硕论文– 近年来的学术会议论文

• 由点到面，由杂到精，由量到质• 好记性不如烂笔头• 善于总结，提出自己的想法

– 我应该如何利用该论文？– 真的像作者宣称的那样么？– 如果……会发生什么？

scheduling for grid computing

Documents