analyzing and minimizing the impact of opportunity cost in qos-aware job scheduling

32
Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling M. Islam, P. Balaji, G. Sabin and P. Sadayappan Computer Science and Engineering, Ohio State University Mathematics and Computer Science, Argonne National Laboratory RNet Technologies

Upload: dinesh

Post on 24-Feb-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling. M. Islam , P. Balaji , G. Sabin and P. Sadayappan. Computer Science and Engineering, Ohio State University Mathematics and Computer Science, Argonne National Laboratory RNet Technologies. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Analyzing and Minimizing the Impact of Opportunity Cost in

QoS-aware Job Scheduling

M. Islam, P. Balaji, G. Sabin and P. Sadayappan

Computer Science and Engineering, Ohio State University

Mathematics and Computer Science, Argonne National Laboratory

RNet Technologies

Page 2: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

• Publicly Usable Supercomputer Centers– Becoming increasingly common (OSC, SDSC, etc)– Jobs submitted with resource requirements

• CPUs, Memory, Estimate Runtime• Scheduler maps the requirements of the jobs to available resources

– If resources are available, job is scheduled immediately– Else, queued and scheduled to execute at a later time– Several job schedulers existing today: PBS, Maui, Silver

• Independent Parallel Job Scheduling Model– Dynamically arriving Independent Parallel Jobs– Popular model in most supercomputers

Job Schedulers Today

Page 3: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Job Scheduler Processor Space

UserExecution Queue

Reservation Queue

Processors’ Status

P6

P1 P2

P3 P4

P5

Simple Job Scheduler Model

Job J1; 2 processors; 1 hour

J1

J2

J3

Job J2; 5 processors; 1 hourJob J3; 4 processors; 1 hour

Page 4: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

J1

J2

Time

Pro

cess

ors J3

Current Time

J5 J6J4Job Queue

Running Jobs

Two Dimensional Scheduling Grid

Page 5: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

• Significant prior research on best-effort scheduling• Optimizations proposed for different metrics

– Utilization (U): what fraction of the resources is actually utilized. • U = Resource Used / Resource Provided

– Response Time (RT): Time from submission to completion• RT = Job’s completion time – Job’s arrival time

– Slowdown (SD): How much slower is the system as compared to a dedicated system

• SD = Job’s Response Time / Job’s Runtime

– Prioritization: Static (user or group based) and Dynamic (how long the job was in the queue)

• NERSC cluster provides static prioritization based on job cost

Previous Research in Job Scheduling

Page 6: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

• Users can request for guarantees in turnaround time– E.g., Submit a job before leaving work at 5pm and request for a

deadline at 8am the next morning

• Two Components for QoS in Job Scheduling– Job Scheduling Component [islam03:qops]

• Admission Control: Can we meet the specified deadline?• Once admitted, cannot miss the specified deadline

– Revenue Management• Appropriate charging model• Urgent jobs cost more than non-urgent jobs• Need to prioritize jobs such that the incoming revenue is maximized

[islam03:qops] “QoPS: A QoS based scheme for Parallel Job Scheduling”, M. Islam, P. Balaji, P. Sadayappan and D. K. Panda. Published in JSSPP ’03 and LNCS ‘04.

QoS in Job Scheduling

Page 7: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

J1

J2

Time

Pro

cess

ors J3

Current Time

Running Jobs

Opportunity Cost in Job Scheduling

J4 (10$)

D4

J5 (500$)D5

By scheduling J4, we lost the future opportunity to schedule the more expensive job J5

J4 has an opportunity cost of at least 500$

Page 8: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Problem Statement

• When the user submits a job, she pays an explicit cost

• However, the system also pays an implicit opportunity cost

• Accepting a job is beneficial if its explicit cost is greater

than its opportunity cost

• How do we determine the opportunity cost?– It depends on future jobs no way to know

• How do we design a predictive algorithm to estimate the

opportunity cost of a job?

Page 9: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Presentation Layout

• Introduction and Motivation

• Background on QoPS and QoS Cost Models

• Minimizing Opportunity Cost with Value-aware QoPS

• Dynamic “Self-learning” Value-aware QoPS

• Performance Results

• Conclusions

Page 10: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

• Advanced Reservation (before QoPS)– Before QoPS, the only way to guarantee a turnaround time

• Execution time window statically decided upfront

– Resources underutilized due to fragmentation– If resources are available early, the job can’t be rescheduled

• Primary Goals of QoPS:– Provide admission control

• When a new job arrives:– Reorder existing jobs to find feasible schedules– Select the best feasible schedule

– Ensure deadline guarantees for the accepted jobs• A later arriving job cannot force an existing job to miss its deadline!

QoPS: QoS for Parallel Job Scheduling

Page 11: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

• Most supercomputer centers today do not provide QoS– Jobs are scheduled in a best-effort manner

– Thus, no special cost models for QoS either

• Some supercomputers provide prioritization (e.g., NERSC)– Different queues of jobs exist

– More expensive queues get higher priority

• For QoS-driven supercomputers, a new model required– Provider-centric: Supercomputer-center determines the charge

– User-centric: User offers the price / bid

Supercomputer Cost Model

Page 12: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Market-based User-centric Cost Model• User offers a price to the system

– Market-based bidding system– Proposed by Culler and Chase

• Price offered reduces with time (decay factor)• Offered price touches zero at the job deadline time

Rev

enue

Time

Maximum Revenue

Deadline

Page 13: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Presentation Layout

• Introduction and Motivation

• Background on QoPS and QoS Cost Models

• Minimizing Opportunity Cost with Value-aware QoPS

• Dynamic “Self-learning” Value-aware QoPS

• Performance Results

• Conclusions

Page 14: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Value-aware QoPS (VQoPS)• Job acceptance based on two criteria:

– The deadline should be achievable (evaluated using QoPS)– The job should provide enough revenue so as to offset a statically

assumed opportunity cost• Product a fixed opportunity cost factor (OC-Factor) and the size of the

job (i.e., number of processor-hours requested)• Large jobs (more nodes or long running) have a higher opportunity

cost since they can potentially impact more later arriving jobs

• The OC-Factor has to be tuned by the system administrator based on the expected workload!– Complicated to evaluate– Difficult to adapt if workload changes

Page 15: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

J1

J2

Time

Pro

cess

ors J3

Current Time

Running Jobs

VQoPS: An Example Scenario

J4 (10$)

D4

J5 (500$)D5

By not scheduling J4, we retained the future opportunity to schedule the more expensive job J5

Choosing the right OC-Factor is important for the scheme to be effective

Less than static opportunity cost (C)

Page 16: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

VQoPS performance for different tracesRelative Urgency

Cost

Urgent Jobs (%)

Offered Load

OC-Factors

0.00 0.05 0.1 0.2 0.4

10X 80% Original 21% 26% 37% 37% 39%

5X 80% Original 20% 25% 34% 35% 30%

2X 80% Original 19% 26% 27% -47% -100%

10X 80% Original 21% 26% 37% 37% 39%

10X 50% Original 23% 34% 46% 45% 45%

10X 20% Original 26% 38% 22% 22% 22%

10X 80% Original 21% 26% 37% 37% 39%

10X 80% High 63% 90% 135% 144% 160%

Page 17: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

VQoPS performance for different tracesRelative Urgency

Cost

Urgent Jobs (%)

Offered Load

OC-Factors

0.00 0.05 0.1 0.2 0.4

10X 80% Original 21% 26% 37% 37% 39%

5X 80% Original 20% 25% 34% 35% 30%

2X 80% Original 19% 26% 27% -47% -100%

10X 80% Original 21% 26% 37% 37% 39%

10X 50% Original 23% 34% 46% 45% 45%

10X 20% Original 26% 38% 22% 22% 22%

10X 80% Original 21% 26% 37% 37% 39%

10X 80% High 63% 90% 135% 144% 160%• No single static OC-Factor is best for all cases.• Best OC-Factor is dependent on trace characteristics.

Page 18: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Presentation Layout

• Introduction and Motivation

• Background on QoPS and QoS Cost Models

• Minimizing Opportunity Cost with Value-aware QoPS

• Dynamic “Self-learning” Value-aware QoPS

• Performance Results

• Conclusions

Page 19: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

• Estimate OC-Factor dynamically for best revenue gain• OC-Factor depends on

– System Load– Relative frequency of urgent jobs– Relative price of urgent jobs

• DVQoPS considers a history-based adaptive technique to consider all of the factors– Perform a what-if simulation by rolling back and find the best

OC-Factor

Dynamic “Self-learning” Value-aware QoPS

Page 20: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

What-if Simulations in DVQoPSOC Factor = O

O1 O2 O3 ON

OC Factor = O3

O1 O2 O3 ON

OC Factor = O

O3 gave us the best revenue pick O3O2 gave us the best revenue pick O2

OC Factor = O2

We dynamically pick the OC-Factor that gave the best revenue in the previous roll-back interval

Page 21: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Impact of Rollback Window Size• Balancing Sensitivity and Stability

– Sensitivity: Too long a rollback window loses sensitivity to small changes in the workload

– Stability: Too short a rollback window loses stability and causes the results to be noisy

• Need to calculate rollback window dynamically

Rollback Window Size

Average Instability in OC-Factor

Load Variance Sensitivity

Revenue

4 6.18 2.89 508341077

32 2.99 0.34 692266945

48 1.36 0.24 715606095

128 1.13 0.04 701476009

Page 22: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Presentation Layout

• Introduction and Motivation

• Background on QoPS and QoS Cost Models

• Minimizing Opportunity Cost with Value-aware QoPS

• Dynamic “Self-learning” Value-aware QoPS

• Performance Results

• Conclusions

Page 23: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

• Two categories of jobs– Urgent Jobs– Normal Jobs

• Job Mixes (Urgent, Normal):– (80%, 20%), (50%, 50%), (20%, 80%)

• Urgency factor:– Urgent job Revenue = URG_FACT x Normal Job Revenue– URG_FACT used 10, 5, 2– URG_FACT refers to the height and steepness of the cost

model curve

Simulation Setup

Page 24: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Impact of Job Mix (% of Urgent Jobs)Revenue Improvement (normal load)

0%

10%

20%

30%

40%

50%

60%

80% 50% 20%

% Urgent Jobs

Per

cent

age

Impr

ovem

ent

VQoPS-0.05

VQoPS-0.1

VQoPS-0.2

VQoPS-0.4

DVQoPS

Revenue Improvement (high load)

0%

50%

100%

150%

200%

250%

80% 50% 20%

% Urgent Jobs

Per

cent

age

Impr

ovem

ent

VQoPS-0.05VQoPS-0.1

VQoPS-0.2VQoPS-0.4

DVQoPS

DVQoPS performs within 2-3% of the best VQoPS implementation

Page 25: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Service Differentiation and Job UrgencyService Differentiation

0

0.2

0.4

0.6

0.8

1

1.2

QoPS VQoPS-0.05

VQoPS-0.1 DVQoPS

Acc

epte

d Lo

ad

Urgent Normal Overall

Job Urgency

-120%

-100%

-80%

-60%

-40%

-20%

0%

20%

40%

60%

10X 5X 2X

Job Urgency Factor

Rev

enue

Impr

ovem

ent

VQoPS-0.05

VQoPS-0.1

VQoPS-0.4

DVQoPS

DVQoPS provides appropriate amount of service differentiation depending on the cost difference

As job urgency increases, higher VQoPS values perform better DVQoPS automatically adjusts itself

Page 26: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Impact of Inaccurate User EstimatesImpact of Inaccurate User Estimates

-10%

-5%

0%

5%

10%

15%

20%

80% 50% 20%

Percentage of Urgent Jobs

Rev

enue

Impr

ovem

ent

VQoPS-0.05 VQoPS-0.1

VQoPS-0.2 DVQoPS

• Overall improvement in

revenue drops considerably– Inaccurate estimates result in

a lot of wastage due to strict

provisioning

• DVQoPS still performs

within 2% of the best

VQoPS implementation

• 15% better than QoPS

Page 27: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Presentation Layout

• Introduction and Motivation

• Background on QoPS and QoS Cost Models

• Minimizing Opportunity Cost with Value-aware QoPS

• Dynamic “Self-learning” Value-aware QoPS

• Performance Results

• Conclusions

Page 28: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Concluding Remarks and Future Work• QoS in Scheduling is a new concept with growing interest

– Schemes such as QoPS (our previous work) that provide deadlines exist, but they do not deal with system revenue

• In this paper, we analyzed the behavior of systems when a cost model is introduced– System dynamism adds a new parameter “Opportunity Cost” which

makes the issue unpredictable– We presented two schemes, VQoPS and DVQoPS, which analyze

Opportunity cost and minimize its impact– Simulations show up to 200% better performance in some cases

• Future Work: Integrating QoS and prioritization and incorporating the code into standard schedulers

Page 30: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Backup slides

Page 31: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

J6 J5 J4 J3 J2 J1

JN

J6 J5 J4 J3 J2 J1J6 J5 J4 J3 J2 JN J1

J1

J6 J5 J4 J3 J2 JNJ6 J5 J4 J3 J2

JN

J1

J2

J1

JN

J6 J5 J4 J3J6 J5 J4

J3

J1

JN

J2

J1

JN

J3

J6 J5 J4 J2

J1

JN

J3

J2

J6 J5 J4

MAX_ALLOWED_VIOLATION = 2

CURRENT_VIOLATION = 0

J6 J5 J4 J2 J3

J1

JNCURRENT_VIOLATION = 1

JN

J6 J5 J4 J3 J2 J1

JN

J6 J5 J4 J3 J2 J1

JN

J6 J5 J4 J3 J2 J1

JN

J6 J5 J4 J3 J2 J1

JN

J6 J5 J4 J3 J2 J1

QoPS: An Example Scenario

Page 32: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling

Rollback Interval

• Effective rollback interval is estimated in every MAX_ROLLBACK_INTERVAL (e.g. 128 Hr)

• MaxRevenue = Revenue (currentSchedule)• For each testInterval in {1hr, 4hr, 16hr, 64hr, 128Hr}

– Run what-if simulation by rolling back testInterval – Revenue = Calculate revenue of the schedule– If Revenue > MaxRevenue

• MaxRevenue = Revenue• Effective Rollback Interval = testInterval

• End for