from static scheduling - ciceseusuario.cicese.mx/~chernykh/papers/15281.andrei tchernykh...•...
TRANSCRIPT
Algorithms and Scheduling Techniques to Manage Resilience and
Power Consumption in Distributed Systems
Andrei TchernykhCICESE Research Center,
Ensenada, Baja California, México
http://usuario.cicese.mx/~chernykh/
Dagstuhl – July 7, 2015
From Static Scheduling Towards Understanding Uncertainty
Baja California,México
Ensenada, Baja California, México
Sonora
Yaqui deer dancer
Research Areas
HPC
Cloud Computing
SchedulingResource optimization
online offline
Real T
ime S
yste
ms
Gri
d C
om
pu
tin
g
Knowledge Free
Scheduling with
Uncertainty
Multiobjective
Optimization
Computational
Intelligence
List Scheduling Stealing
Scheduling with
Service Levels
Approximation
Algorithms
Workflow Scheduling
Collaboration
Mexico
USA
Uruguay
SpainFrance
Russia
Luxembourg
Germany Universidad Autónoma de Baja CaliforniaUniversidad Autónoma de Nuevo LeónTecnológico de MonterreyInstituto Tecnológico de MoreliaCentro de Estudios Superiores del Estado
de Sonora
Dortmund UniversityProf. Uwe Schwiegelshohn
University of GöttingenProf. Ramin Yahyapour
Institute for System Programming, RASProf. Arutyun AvetisyanProf. Nikolay Kuzurin
Moscow Institute of Physics and TechnologyProf. Alexander Drozdov
Institute of Informatics and Applied Mathematics of Grenoble Prof. Denis TrystramINRIA Lille - Nord EuropeProf. El-ghazali Talbi
University of Notre Dame Dr. Jarek Nabrzyski
University of California –Irvine, CA, USA
Prof. Isaac Scherson, Prof. Jean Luc Gaudiot
University of LuxembourgProf. Pascal BouvryDr. Dzmitry Kliazovich
Universidad de la RepúblicaDr. Sergio Nesmachnow
BSCProf. Vassil Alexandrov
CICESE Parallel Computing Laboratory
Team
8
9CICESE Parallel Computing Laboratory
Towards Understanding Uncertainty in
Cloud Computing Resource
Provisioning
Andrei Tchernykh CICESE Research Center, Mexico
Uwe Schwiegelshohn University of Dortmund, Germany
El-ghazali Talbi University of Lille, France
Vassil Alexandrov Barcelona Supercomputing Centre, Spain
ICCS-SPU 2015. Procedia Computer Science, Elsevier, 2015
10CICESE Parallel Computing Laboratory
Uncertainty
Can be classified in several different ways according to their
nature:
1. Long-term uncertainty is due to the object is poorly understood and
inadvertent factors can influence its behavior.
2. Retrospective uncertainty is due to the lack of information about the
behavior of the object in the past.
3. Technical uncertainty is a consequence of the impossibility of
predicting the exact results of decisions
4. Stochastic uncertainty is a result of probabilistic (stochastic) nature of
the studied processes and phenomena.
• there is a reliable statistical information;
• statistical information is not available;
• hypothesis on the stochastic nature requires verification.
Tychinsky 2006
11CICESE Parallel Computing Laboratory
Uncertainty
5. Constraint uncertainty - partial or complete ignorance of the
conditions.
6. Participant uncertainty - conflict of main stakeholders: cloud
providers, users and administrators.
• own preferences, incomplete, inaccurate information about the
motives and behavior of opposing parties.
7. Goal uncertainty
• inability to select one goal
• conflicts in building multi objective optimization model.
• competing interests
8. Condition uncertainty occurs when a failure or a complete lack of
information about the conditions under which decisions are made.
12CICESE Parallel Computing Laboratory
Uncertainty
9. Action uncertainty occurs when there is no ambiguity when choosing
solutions.
• Single objective case
o determine the best solution among all feasible ones;
• In multiple objective case,
o there exists a (possibly infinite) number of Pareto optimal
solutions.
o There is the problem of finding a good element of this set.
13CICESE Parallel Computing Laboratory
Uncertainty
Can be grouped into:
parameter (parametric) uncertainties
1. arise from the incomplete knowledge and variation of the
parameters
2. estimated using statistical techniques
system uncertainties.
1. arise from an incomplete understanding of the processes that
control service provisioning
2. incomplete information about a system
14CICESE Parallel Computing Laboratory
Services and resources
are subject to considerable uncertainty during provisioning.
Uncertainty brings additional challenges to
• End-users
• Resource providers
• Brokering
It requires
• waiving habitual computing paradigms
• adapting current computing models
• designing novel resource management strategies to handle
uncertainty in an effective way
Uncertainty in Clouds
The question is:
How to deliver scalable and robust cloud behavior under uncertainties
and specific constraints, such as budgets, QoS, SLA, energy costs; etc.
15CICESE Parallel Computing Laboratory
• dynamic elasticity
• dynamic performance changing
• virtualization, loosely coupling application to the infrastructure
• resource provisioning time variation
• inaccuracy of application runtimes, variation of processing times
• variation in data transmission, variable data streams,
• release time and workload uncertainty
• effective bandwidth variation,
and other phenomenon.
Sources of uncertainty
• workload is not predictable and can be changed dramatically
• performance can be changed due to sharing of common resources
with other VM
16CICESE Parallel Computing Laboratory
Providers might not know the
• Quantity of transmitted data
• Amount of computation
Example:
Every time when a user requires a status of his e-mail or bank
account, it could generate
• different amount of data and
• take different time for delivering.
Sources of uncertainty
17CICESE Parallel Computing Laboratory
It is impossible to get exact knowledge about the system.
Parameters such as
• effective processor speed,
• number of available processors,
• actual bandwidth
are changing over the time.
Topology is unknown
In general, an execution environment will differ for each
program/service invocation.
Sources of uncertainty
18CICESE Parallel Computing Laboratory
Sources of uncertainty
Da
ta (
vo
lum
e, va
rie
ty,
va
lue
)V
irtu
aliz
atio
n
Jo
bs a
rriv
al
Mig
ratio
n
En
erg
y m
inim
iza
tio
n
Fa
ult to
lera
nce
Scala
bili
ty
Cost (d
ynam
ic p
ricin
g)
Resourc
e a
vaila
bili
ty
Ela
sticity
Co
nso
lida
tio
n
Co
mm
un
ica
tio
nR
ep
lica
tio
n
Clo
ud
in
fra
str
uctu
reR
eso
urc
e p
rovis
ion
ing
tim
e
Clo
ud
co
mp
uti
ng
pa
ram
ete
rs
Effective performance ● ● ● ● ● ● ● ● ● ● ● ●
Effective bandwidth ● ● ● ● ● ● ● ● ● ● ● ●
Processing time ● ● ● ● ● ● ● ● ● ● ● ●
Available memory ● ● ● ● ● ● ● ● ● ● ● ●
Number of processors ● ● ● ● ● ● ● ● ● ●
Available storage ● ● ● ● ● ● ● ● ● ●
Data transfer time ● ● ● ● ● ●
Resource capacity ● ● ● ● ● ● ● ●
Network capacity ● ● ● ● ● ● ●
Source of uncertainty
19CICESE Parallel Computing Laboratory
Approaches
To treat uncertainly and dynamism we need sophisticated
solutions.
• Fuzzy,
• Robust,
• Non-clairvoyant
• Knowledge-free
• Stochastic
• Randomized algorithms
• Dynamic priority
• Adaptive strategies (reactive)
• Dynamic load balancing
Preliminary results
Scheduling for Cloud Computing with
Different Service Levels
IPDPS 2012, IEEE 26th International Parallel and Distributed Processing Symposium
Uwe Schwiegelshohn University of Dortmund, Germany
Andrei Tchernykh CICESE Research Center, Mexico
Quality of Service
CICESE Parallel Computing Laboratory 22
Deadline Service Level (slack factor) Execution time
Profit
Response time in relation to the requested processing time
price per time unit
Competitive
Factor
Obtained Income
Optimal income
CICESE Parallel Computing Laboratory 23
Competitive Factor
Competitive Factor
CICESE Parallel Computing Laboratory 24
SSL-SM 𝝆 ≤ 𝟏 − (𝟏 −𝒑𝒎𝒊𝒏
𝒑𝒎𝒂𝒙)𝟏
𝒇
SSL-MM𝝆 ≤
𝒇
𝟏 + 𝒇(𝟏 −𝒑𝒎𝒊𝒏𝒑𝒎𝒂𝒙
)
Das Gupta and Palis, 2001
Schwiegelshohn,Tchernykh 2012
Competitive Factor
CICESE Parallel Computing Laboratory 25
𝝆 ≤ 𝒎𝒂𝒙{
𝒑𝒎𝒊𝒏𝒑𝒎𝒂𝒙
𝒇𝑰 − 𝟏,𝒇𝑰 − 𝟏 +
𝒑𝒎𝒊𝒏𝒑𝒎𝒂𝒙
𝒇𝑰 − 𝟏 +𝒖𝑰𝒖𝑰𝑰
MSL-SM
MSL-MM 𝝆 ≤𝒖𝑰𝑰𝒖𝑰
(𝟏 −𝟏
𝒇𝑰)
Schwiegelshohn,Tchernykh 2012
On-line Scheduling in Distributed Systems
Multiple strip packing
Job Stealingnon-clairvoyant
Uwe Schwiegelshohn University of Dortmund, Germany
Andrei Tchernykh CICESE Research Center, Mexico
Ramin Yahyapour University of Göttingen, Germany
IEEE IPDPS 200ß
Any machine applies a priority order when selecting jobs for execution:
Jobs of its group A
Jobs of its group B
Jobs that are enabled for execution on its previous machine.
CICESE Parallel Computing Laboratory 27
Grid Scheduling Algorithm
Uwe Schwiegelshohn
• Theoretical evaluation
– Cmax(LIST)/Cmax* < 3 in the offline case
– Cmax(LIST)/Cmax* < 5 in the online case
CICESE Parallel Computing Laboratory 28
Performance of the Algorithm
IEEE IPDPS, 2008
(Klaus Jansen, Denis Trystram et. al…) 5/2, 7/3, 2 + ε, 2 –approximations
Improved by …
On-line Scheduling in Distributed Systems
Multiple strip packing
Adaptive Admissible Allocation
Future Generation Computer Systems 2012
Journal of Scheduling, 2010
Andrei Tchernykh CICESE Research Center
José Luis González-García Mexico
Vanessa Miranda-López
Uwe Schwiegelshohn University of Dortmund
Germany
Ramin Yahyapour University of Göttingen
Germany
30CICESE Parallel Computing Laboratory
…
m1 m2 m3 m4 m5 mm
first(Jj) = 2
last(Jj) = m
M-available
M-admis
last(Jj) = 5
If last is the minimum r such that
m
jfirstii
r
jfirstii
jj
mam)()(
Allocation
31CICESE Parallel Computing Laboratory
For a set of machines with identical processors, and for a set of rigid jobs
with admissible range
Approximation factor (off-line)
Min_LB-a + Best_PS
10 a
Adaptive optimization
)1(
21
21
2
aa
a
mf
rf
mf
rf
m
map
m
map
,
,
,
,
0
0
ara
ara
)1(
23
23
2
aa
a
mf
rf
mf
rf
m
map
m
map
,
,
,
,
0
0
ara
ara
Competitive factor (on-line)
Min_LB-a + Best_PS
Tchernykh, et al 2012
Future Generation Computer Systems, Elsevier
Tchernykh, et al 2010
Journal of Scheduling, Springer
Time
Cmax(LIST)=4
CICESE Parallel Computing Laboratory 32
List Scheduling in the Grid
Machines with different numbers of processors
a=1
100% 100%
Time
Machines with different numbers of processors
Cmax(LIST)=2
CICESE Parallel Computing Laboratory 33
Admissible Allocation in Grid
a=0.5a=1
Theoretical Evaluation
34CICESE Parallel Computing Laboratory
Gridscheduling
On-line
No clarivoyant
Different machine sizes
Off-line
No clarivoyant
Clarivoyant
Equal machine sizes
Differentmachine sizes
(Schwiegelshon et al. 2008) 3--approximation
(Pascual et al. 2008)4--approximation
(Klaus Jansen, Denis Trystram) 5/2, 7/3, 2 + ε, 2 –approximation(Zhuk et al. 2004) 10--approximation(Tchernykh et al. 2005) 10—approximation(Tchernykh et al. 2012) 3—approximation
(Tchernykh et al. 2008) 5-competitive(Tchernykh et al. 2010) 17-competitive(Schwiegelshon 2010) (2e+1)-competitive(Tchernykh et al. 2012) 5-competitive
• Future Generation Computer Systems, Elsevier
• Journal of Scheduling, Springer
• Discrete Applied Mathematics, Elsevier
• Tran Fund Elec, Comm. & Comp. Science, IEICE
• Parallel and Distributed Processing, IEEE
• Computers & Industrial Engineering, Elsevier
Job Allocation Strategies with User Run
Time Estimates
Journal of Grid Computing , Springer, 2011
Juan Manuel Ramírez
Andrei Tchernykh CICESE Research Center
José Luis González Mexico
Adán Hirales-Carbajal
Uwe Schwiegelshohn University of DortmundGermany
Ramin Yahyapour University of Göttingen
Germany
Multiple Workflow Scheduling
Strategies
with User Run Time Estimates
on a Grid
Journal of Grid Computing , Springer, 2012
Adán Hirales-Carbajal
Andrei Tchernykh CICESE Research Center
José Luis González Mexico
Juan Manuel Ramírez
Thomas Röblitz University of DortmundGermany
Ramin Yahyapour University of Göttingen
Germany
Adaptive Resource Allocation
in Computational Grids with
Runtime Uncertainty
Andrei Tchernykh CICESE Research Center
Raul Ramírez-Velarde Tecnológico de MonterreyCarlos Barba-Jimenez MéxicoJuan Nolazco
Adán Hirales-Carbajal CETYS University, Mexico
Model uses notion of • heavy-tails• self-similarityfor the predictability of run-time estimate
Energy-Aware Online Scheduling:
Ensuring Quality of Service for
IaaS Clouds
Andrei Tchernykh CICESE Research Center, Mexico
Luz Lozano
Uwe Schwiegelshohn University of Dortmund, Germany
Johnatan Pecero University of Luxembourg, Luxembourg
Pascal Bouvry
Sergio Nesmachnov Universidad de la República, Uruguay
Alexander Yu. Drozdov Moscow Institute of Physics and Technology
Journal of Grid Computing , Springer, 2015
Solution space, Pareto optimal solutions
CICESE Parallel Computing Laboratory 39
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 1 2 3 4 5 6 7 8 9
Inco
me d
eg
rad
ati
on
Power consumption degradation
Adaptive energy efficient scheduling
in Peer-to-Peer desktop gridsKnowledge Free Scheduling
Future Generation Computer Systems. 2013
Andrei Tchernykh CICESE Research Center, Mexico
Aritz Barrondo
Johnatan E. Pecero University of Luxembourg, Luxembourg
Elisa Schaeffer Universidad Autónoma de Nuevo León,Mexico
41CICESE Parallel Computing Laboratory
Work Queue with Replication (WQR)
Time
Resources
Time+Resources
OurGrid, BOINC
SETI@home, folding@home, Rosetta@home
Einstein@home,
+50 projects
A VoIP Service for Cloud
Infrastructure
Andrei Tchernykh CICESE Research Center, Mexico
Jorge Mario Cortez
Johnatan E. PeceroPascal Bouvry University of Luxembourg, LuxembourgAna-Maria SimionoviciDzmitry Kliazovich
Loic Didelot MIXvoip S.A. Luxembourg
Denis Trystram Grenoble institute of Technology, France
Problem
43CICESE Parallel Computing Laboratory
Two objectives:
- Provider cost optimization
- Voice Quality
Bin-packing approach (well-known)
• one-dimensional, on-line
• classic NP-hard optimization problem
The principal novelty
• state of the bin is determined not only
by actions of the decision maker during
item allocations,
• but also by item completions after their
lifespan.
Unlike in standard formulation,
• bins are always open
• dynamic
• items in bins can be terminated (call
termination)
• utilization can be changed
Cloud Infrastructure Cost
Optimization:
to buy or to lease
Uwe Schwiegelshohn University of DortmundStephan Schlagkamp Germany
Andrei Tchernykh CICESE Research Center
Fermin Armenta Mexico
Cloud Provider Cost
Our objective is to:
• avoid overprovisioning
• find the resource capacity of the private cloud
• minimize total investment and leasing costs with respect to the
demand forecast
45CICESE Parallel Computing Laboratory
46CICESE Parallel Computing Laboratory
Modeling applications with
communications and uncertainty
Dzmitry Kliazovich
Johnatan E. Pecero University of Luxembourg, Luxembourg
Pascal Bouvry
Andrei Tchernykh CICESE Research Center, Mexico
Samee U. Khan North Dakota State University, U.S.A.
Albert Y. Zomaya University of Sydney, Australia
• IEEE CLOUD 2013 - IEEE 6th International Conference on Cloud Computing.
• Journal of Grid Computing , Springer, 2015
47CICESE Parallel Computing Laboratory
Modeling Applications
• Proposed CA-DAG: Communication-Aware DAG model– Two types of vertices: one for
computing and one for communications
– Edges define dependences between tasks and order of execution
• Main advantage– Allows separate resource allocation
decisions, assigning processors to handle computing jobs and network resources for information transmissions
1
3
2
4
Communication task
Computing task
Ordinary edge
Thanks for your attention!
Redmer Hoekstra