grid scheduling

1

Grid Scheduling

Cécile Germain-Renaud

2

Scheduling

• Job– A computation to run on a machine

– Possibly with network access e.g. input/output file (coarse grain) or communication with other jobs (the DAG model)

• Schedule– s(J) = date to begin execution of task J– Alloc(J) = machine assigned to J

• One of the oldest Computer Science problems

• Principles of classification:

[Graham et al. Optimization and approximation in deterministic sequencing and scheduling: A survey. Ann. Discrete Math. 5, (1979), 287-326]

• Computer-aided classification of complexity results (4536 at the time of the paper) [Lageweg et al. Computer-Aided complexity classification of combinational problems. CACM 11:2, 1892]

3

Classical scheduling in HPC

• Context: parallel computing/computers• Application = Direct Acyclic Graph (T, E, w, c)

– T = set of sequential tasks– E = dependence constraints– w(t) = computational cost of task t– c(t,t’) = communication cost (data sent from t to t’)

• Infrastructure– P identical processors– With or without preemption, dedicated (no sharing)

• An optimization problem with objective functionMakespan = Total execution time S(T) = max (s(t) + w(t))

• Complexity– NP-complete for independant tasks and no communication E = vide, p =2 and c= 0

– NP-complete for UET-UCT graphs (w = c = 11)– Very old: without communication, list scheduling provides a (2-1/p)

approximation

T

T’

4

Scheduling in Institutional Grids

• Institutional: federation of ressources– accounted-for: fair-share on the medium to long time scale is a

premium constraint– Partially autonomous local policies must be allowed

• Grid– Permanent regime: on-line decisions– Large scale: strongly distributed

• Information system • Scheduling services

• Relevant contexts– Autonomous, multi-agents systems– Auction algorithms– Service Level Agreement (SLA) technology

5

EGEE gLite Scheduling

BrokerUI

Localscheduler

Site (node)

CE

Proc

Broker

Broker

UI

UI

UI

UI

6


BrokerUI

Localscheduler

Site (node)

CE

Proc

Broker

Broker

UI

UI

UI

UI

BDII

Publish

7


Localscheduler

Site (node)

CE

Proc

Broker

BDII

Publish

Query

UI

UI

UI

UI

UI

Rank

The information published isStatic: eg which type of VO is acceptedDynamic: expected traversal time

8


Localscheduler

Site (node)

CE

Proc

Broker

BDII

Publish

Query

UI

UI

UI

UI

UI

Rank

Rank: may be any user-defined function, e.g. avoid « bad » machinesDefault is first locality, second expected traversal time

9


Localscheduler

Site (node)

CE

Proc

Broker

BDII

Publish

Query

UI

UI

UI

UI

UI

Update

BDII broker cache

10

Not only academic

Execution time (s)

Overhead Ratio

• Long waiting times

• When EGEE was not so heavily loaded

11

Batch scheduling

• Very complex policies• Maximise throughput under constraints

– Weighted fair-share – VOs, type of jobs– Priorities– Hardware requirements– Advance reservations

• An indication of job duration is given by the type of queue: infinite, long, medium, short, and exotic ones

[B. Bode et al.The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters]

12

Classical vs Grid

• (Relatively) easy:– Throughput instead of makespan + Master-slave graph instead

of DAG allow for instance to define cyclic schedules in polynomial time which are asymptotically optimal, but not local[Y. Robert] [A. Rosenberg]

• Moderately difficult: information about– Applications– Infrastructures

• The same program on different data may run at very different speed• The network performance is dynamic

• Really difficult– Queues managed by local policies– On-line decision

13

Information and Scheduling (I)

• Considerable work has been done in predicting CPU load in shared environments – desktops, clusters, desktop grids [P.A. Dinda, R. Wolski, J. Schopf]– The basic technique is linear time-series analysis

– Self-similarity and epochal behavior– Usual goal is the prediction of the next value– Applied to soft real-time scheduling on shared clusters– Practical application in NWS

zt = + (B)

(B)(1 – B)dat

14

Information and scheduling (III)

• Less work on predicting the behavior of dedicated systems

• Papers are on parallel systems, mostly based on time-series techniques, but at least one based on a genetic algorithm [Downey, Foster, Wolski]

• The traces are much more difficult to access• No time slice - Irregular time series: the records are

event-driven• Which analysis

– Average waiting time: clear but not very useful for prediction– Fitting a distribution: not convincing for // systems– Predicting an upper bound with a confidence interval: metric of

success?

15

Information and grid

• We cannot directly log the entire state of the system– Access rights– Size

• Currently available data– The lifecycle of jobs going through certain brokers– The job ranking at the same brokers– The detailed behavior of the queues on certain sites– Certain = LAL + possibly other mainstream

• Easy to get– Summary data about the lifecycle of all jobs – From which it could be possible to reconstruct the detailed state

and dynamic of the CE

16

What should we learn ?

• Learning besides time series make sense in a grid: massive use of community programs instead of (?) sparse runs of a very long and complex digital experiment

• Information as sketched before– Beware: not be a steady-state system

• New users, new machines, new software is the expected regime for some years from now

• A community-based resource will tend display correlated activity

– Is there an invariant social graph? Is it a feature?

• System algorithms e.g. a site scheduler or the broker – Validation ?

• Scheduling algorithms– Validation ?

grid scheduling

Documents

long time scale

polynomial time

communication e

total execution time

communication cost data

classical scheduling

list scheduling

different data