analyzing and minimizing the impact of opportunity cost in qos-aware job scheduling
DESCRIPTION
Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling. M. Islam , P. Balaji , G. Sabin and P. Sadayappan. Computer Science and Engineering, Ohio State University Mathematics and Computer Science, Argonne National Laboratory RNet Technologies. - PowerPoint PPT PresentationTRANSCRIPT
Analyzing and Minimizing the Impact of Opportunity Cost in
QoS-aware Job Scheduling
M. Islam, P. Balaji, G. Sabin and P. Sadayappan
Computer Science and Engineering, Ohio State University
Mathematics and Computer Science, Argonne National Laboratory
RNet Technologies
• Publicly Usable Supercomputer Centers– Becoming increasingly common (OSC, SDSC, etc)– Jobs submitted with resource requirements
• CPUs, Memory, Estimate Runtime• Scheduler maps the requirements of the jobs to available resources
– If resources are available, job is scheduled immediately– Else, queued and scheduled to execute at a later time– Several job schedulers existing today: PBS, Maui, Silver
• Independent Parallel Job Scheduling Model– Dynamically arriving Independent Parallel Jobs– Popular model in most supercomputers
Job Schedulers Today
Job Scheduler Processor Space
UserExecution Queue
Reservation Queue
Processors’ Status
P6
P1 P2
P3 P4
P5
Simple Job Scheduler Model
Job J1; 2 processors; 1 hour
J1
J2
J3
Job J2; 5 processors; 1 hourJob J3; 4 processors; 1 hour
J1
J2
Time
Pro
cess
ors J3
Current Time
J5 J6J4Job Queue
Running Jobs
Two Dimensional Scheduling Grid
• Significant prior research on best-effort scheduling• Optimizations proposed for different metrics
– Utilization (U): what fraction of the resources is actually utilized. • U = Resource Used / Resource Provided
– Response Time (RT): Time from submission to completion• RT = Job’s completion time – Job’s arrival time
– Slowdown (SD): How much slower is the system as compared to a dedicated system
• SD = Job’s Response Time / Job’s Runtime
– Prioritization: Static (user or group based) and Dynamic (how long the job was in the queue)
• NERSC cluster provides static prioritization based on job cost
Previous Research in Job Scheduling
• Users can request for guarantees in turnaround time– E.g., Submit a job before leaving work at 5pm and request for a
deadline at 8am the next morning
• Two Components for QoS in Job Scheduling– Job Scheduling Component [islam03:qops]
• Admission Control: Can we meet the specified deadline?• Once admitted, cannot miss the specified deadline
– Revenue Management• Appropriate charging model• Urgent jobs cost more than non-urgent jobs• Need to prioritize jobs such that the incoming revenue is maximized
[islam03:qops] “QoPS: A QoS based scheme for Parallel Job Scheduling”, M. Islam, P. Balaji, P. Sadayappan and D. K. Panda. Published in JSSPP ’03 and LNCS ‘04.
QoS in Job Scheduling
J1
J2
Time
Pro
cess
ors J3
Current Time
Running Jobs
Opportunity Cost in Job Scheduling
J4 (10$)
D4
J5 (500$)D5
By scheduling J4, we lost the future opportunity to schedule the more expensive job J5
J4 has an opportunity cost of at least 500$
Problem Statement
• When the user submits a job, she pays an explicit cost
• However, the system also pays an implicit opportunity cost
• Accepting a job is beneficial if its explicit cost is greater
than its opportunity cost
• How do we determine the opportunity cost?– It depends on future jobs no way to know
• How do we design a predictive algorithm to estimate the
opportunity cost of a job?
Presentation Layout
• Introduction and Motivation
• Background on QoPS and QoS Cost Models
• Minimizing Opportunity Cost with Value-aware QoPS
• Dynamic “Self-learning” Value-aware QoPS
• Performance Results
• Conclusions
• Advanced Reservation (before QoPS)– Before QoPS, the only way to guarantee a turnaround time
• Execution time window statically decided upfront
– Resources underutilized due to fragmentation– If resources are available early, the job can’t be rescheduled
• Primary Goals of QoPS:– Provide admission control
• When a new job arrives:– Reorder existing jobs to find feasible schedules– Select the best feasible schedule
– Ensure deadline guarantees for the accepted jobs• A later arriving job cannot force an existing job to miss its deadline!
QoPS: QoS for Parallel Job Scheduling
• Most supercomputer centers today do not provide QoS– Jobs are scheduled in a best-effort manner
– Thus, no special cost models for QoS either
• Some supercomputers provide prioritization (e.g., NERSC)– Different queues of jobs exist
– More expensive queues get higher priority
• For QoS-driven supercomputers, a new model required– Provider-centric: Supercomputer-center determines the charge
– User-centric: User offers the price / bid
Supercomputer Cost Model
Market-based User-centric Cost Model• User offers a price to the system
– Market-based bidding system– Proposed by Culler and Chase
• Price offered reduces with time (decay factor)• Offered price touches zero at the job deadline time
Rev
enue
Time
Maximum Revenue
Deadline
Presentation Layout
• Introduction and Motivation
• Background on QoPS and QoS Cost Models
• Minimizing Opportunity Cost with Value-aware QoPS
• Dynamic “Self-learning” Value-aware QoPS
• Performance Results
• Conclusions
Value-aware QoPS (VQoPS)• Job acceptance based on two criteria:
– The deadline should be achievable (evaluated using QoPS)– The job should provide enough revenue so as to offset a statically
assumed opportunity cost• Product a fixed opportunity cost factor (OC-Factor) and the size of the
job (i.e., number of processor-hours requested)• Large jobs (more nodes or long running) have a higher opportunity
cost since they can potentially impact more later arriving jobs
• The OC-Factor has to be tuned by the system administrator based on the expected workload!– Complicated to evaluate– Difficult to adapt if workload changes
J1
J2
Time
Pro
cess
ors J3
Current Time
Running Jobs
VQoPS: An Example Scenario
J4 (10$)
D4
J5 (500$)D5
By not scheduling J4, we retained the future opportunity to schedule the more expensive job J5
Choosing the right OC-Factor is important for the scheme to be effective
Less than static opportunity cost (C)
VQoPS performance for different tracesRelative Urgency
Cost
Urgent Jobs (%)
Offered Load
OC-Factors
0.00 0.05 0.1 0.2 0.4
10X 80% Original 21% 26% 37% 37% 39%
5X 80% Original 20% 25% 34% 35% 30%
2X 80% Original 19% 26% 27% -47% -100%
10X 80% Original 21% 26% 37% 37% 39%
10X 50% Original 23% 34% 46% 45% 45%
10X 20% Original 26% 38% 22% 22% 22%
10X 80% Original 21% 26% 37% 37% 39%
10X 80% High 63% 90% 135% 144% 160%
VQoPS performance for different tracesRelative Urgency
Cost
Urgent Jobs (%)
Offered Load
OC-Factors
0.00 0.05 0.1 0.2 0.4
10X 80% Original 21% 26% 37% 37% 39%
5X 80% Original 20% 25% 34% 35% 30%
2X 80% Original 19% 26% 27% -47% -100%
10X 80% Original 21% 26% 37% 37% 39%
10X 50% Original 23% 34% 46% 45% 45%
10X 20% Original 26% 38% 22% 22% 22%
10X 80% Original 21% 26% 37% 37% 39%
10X 80% High 63% 90% 135% 144% 160%• No single static OC-Factor is best for all cases.• Best OC-Factor is dependent on trace characteristics.
Presentation Layout
• Introduction and Motivation
• Background on QoPS and QoS Cost Models
• Minimizing Opportunity Cost with Value-aware QoPS
• Dynamic “Self-learning” Value-aware QoPS
• Performance Results
• Conclusions
• Estimate OC-Factor dynamically for best revenue gain• OC-Factor depends on
– System Load– Relative frequency of urgent jobs– Relative price of urgent jobs
• DVQoPS considers a history-based adaptive technique to consider all of the factors– Perform a what-if simulation by rolling back and find the best
OC-Factor
Dynamic “Self-learning” Value-aware QoPS
What-if Simulations in DVQoPSOC Factor = O
O1 O2 O3 ON
OC Factor = O3
O1 O2 O3 ON
OC Factor = O
O3 gave us the best revenue pick O3O2 gave us the best revenue pick O2
OC Factor = O2
We dynamically pick the OC-Factor that gave the best revenue in the previous roll-back interval
Impact of Rollback Window Size• Balancing Sensitivity and Stability
– Sensitivity: Too long a rollback window loses sensitivity to small changes in the workload
– Stability: Too short a rollback window loses stability and causes the results to be noisy
• Need to calculate rollback window dynamically
Rollback Window Size
Average Instability in OC-Factor
Load Variance Sensitivity
Revenue
4 6.18 2.89 508341077
32 2.99 0.34 692266945
48 1.36 0.24 715606095
128 1.13 0.04 701476009
Presentation Layout
• Introduction and Motivation
• Background on QoPS and QoS Cost Models
• Minimizing Opportunity Cost with Value-aware QoPS
• Dynamic “Self-learning” Value-aware QoPS
• Performance Results
• Conclusions
• Two categories of jobs– Urgent Jobs– Normal Jobs
• Job Mixes (Urgent, Normal):– (80%, 20%), (50%, 50%), (20%, 80%)
• Urgency factor:– Urgent job Revenue = URG_FACT x Normal Job Revenue– URG_FACT used 10, 5, 2– URG_FACT refers to the height and steepness of the cost
model curve
Simulation Setup
Impact of Job Mix (% of Urgent Jobs)Revenue Improvement (normal load)
0%
10%
20%
30%
40%
50%
60%
80% 50% 20%
% Urgent Jobs
Per
cent
age
Impr
ovem
ent
VQoPS-0.05
VQoPS-0.1
VQoPS-0.2
VQoPS-0.4
DVQoPS
Revenue Improvement (high load)
0%
50%
100%
150%
200%
250%
80% 50% 20%
% Urgent Jobs
Per
cent
age
Impr
ovem
ent
VQoPS-0.05VQoPS-0.1
VQoPS-0.2VQoPS-0.4
DVQoPS
DVQoPS performs within 2-3% of the best VQoPS implementation
Service Differentiation and Job UrgencyService Differentiation
0
0.2
0.4
0.6
0.8
1
1.2
QoPS VQoPS-0.05
VQoPS-0.1 DVQoPS
Acc
epte
d Lo
ad
Urgent Normal Overall
Job Urgency
-120%
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
10X 5X 2X
Job Urgency Factor
Rev
enue
Impr
ovem
ent
VQoPS-0.05
VQoPS-0.1
VQoPS-0.4
DVQoPS
DVQoPS provides appropriate amount of service differentiation depending on the cost difference
As job urgency increases, higher VQoPS values perform better DVQoPS automatically adjusts itself
Impact of Inaccurate User EstimatesImpact of Inaccurate User Estimates
-10%
-5%
0%
5%
10%
15%
20%
80% 50% 20%
Percentage of Urgent Jobs
Rev
enue
Impr
ovem
ent
VQoPS-0.05 VQoPS-0.1
VQoPS-0.2 DVQoPS
• Overall improvement in
revenue drops considerably– Inaccurate estimates result in
a lot of wastage due to strict
provisioning
• DVQoPS still performs
within 2% of the best
VQoPS implementation
• 15% better than QoPS
Presentation Layout
• Introduction and Motivation
• Background on QoPS and QoS Cost Models
• Minimizing Opportunity Cost with Value-aware QoPS
• Dynamic “Self-learning” Value-aware QoPS
• Performance Results
• Conclusions
Concluding Remarks and Future Work• QoS in Scheduling is a new concept with growing interest
– Schemes such as QoPS (our previous work) that provide deadlines exist, but they do not deal with system revenue
• In this paper, we analyzed the behavior of systems when a cost model is introduced– System dynamism adds a new parameter “Opportunity Cost” which
makes the issue unpredictable– We presented two schemes, VQoPS and DVQoPS, which analyze
Opportunity cost and minimize its impact– Simulations show up to 200% better performance in some cases
• Future Work: Integrating QoS and prioritization and incorporating the code into standard schedulers
Thank You!
Contacts:
M. Islam: [email protected]
P. Balaji: [email protected]
G. Sabin: [email protected]
P. Sadayappan: [email protected]
Web pointers:
http://www.mcs.anl.gov/~balaji
Backup slides
J6 J5 J4 J3 J2 J1
JN
J6 J5 J4 J3 J2 J1J6 J5 J4 J3 J2 JN J1
J1
J6 J5 J4 J3 J2 JNJ6 J5 J4 J3 J2
JN
J1
J2
J1
JN
J6 J5 J4 J3J6 J5 J4
J3
J1
JN
J2
J1
JN
J3
J6 J5 J4 J2
J1
JN
J3
J2
J6 J5 J4
MAX_ALLOWED_VIOLATION = 2
CURRENT_VIOLATION = 0
J6 J5 J4 J2 J3
J1
JNCURRENT_VIOLATION = 1
JN
J6 J5 J4 J3 J2 J1
JN
J6 J5 J4 J3 J2 J1
JN
J6 J5 J4 J3 J2 J1
JN
J6 J5 J4 J3 J2 J1
JN
J6 J5 J4 J3 J2 J1
QoPS: An Example Scenario
Rollback Interval
• Effective rollback interval is estimated in every MAX_ROLLBACK_INTERVAL (e.g. 128 Hr)
• MaxRevenue = Revenue (currentSchedule)• For each testInterval in {1hr, 4hr, 16hr, 64hr, 128Hr}
– Run what-if simulation by rolling back testInterval – Revenue = Calculate revenue of the schedule– If Revenue > MaxRevenue
• MaxRevenue = Revenue• Effective Rollback Interval = testInterval
• End for