on-line, non-clairvoyant optimization of workflow activity granularity task on grids

24
1 Rafael Ferreira da Silva – [email protected] On-line, Non-Clairvoyant Optimization of Workflow Activity Granularity on Grids Rafael FERREIRA DA SILVA , Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon Lyon, France Euro-Par 2013 August 26-30, 2013

Upload: rafael-ferreira-da-silva

Post on 21-Jun-2015

179 views

Category:

Technology


0 download

DESCRIPTION

Presentation held at Euro-Par 2013, Aachen, Germany Abstract. Controlling the granularity of workflow activities executed on widely distributed computing platforms such as grids is required to reduce the impact of task queuing and data transfer time. Most existing granularity control approaches assume extensive knowledge about the applications and resources (e.g. task duration on each resource), and that both the workload and available resources do not change over time. We propose a granularity control algorithm for platforms where such clairvoyant and offline conditions are not realistic. Our method groups tasks when the fineness degree of the application, which takes into account the ratio of shared data and the queuing/round-trip time ratio, becomes higher than a threshold determined from execution traces. The algorithm also de-groups task groups when new resources arrive. The application's behavior is constantly monitored so that the characteristics useful for the optimization are progressively discovered. Experimental results, obtained with 3 workflow activities deployed on the European Grid Infrastructure, show that (i) the grouping process yields speed-ups of about 2.5 when the amount of available resources is constant and that (ii) the use of de-grouping yields speed-ups of 2 when resources progressively appear. More information: www.rafaelsilva.com

TRANSCRIPT

Page 1: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

1 Rafael Ferreira da Silva – [email protected]

On-line, Non-Clairvoyant Optimization of Workflow Activity Granularity on Grids

Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon

Lyon, France

Euro-Par 2013 August 26-30, 2013

Page 2: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Outline

  Context   The Virtual Imaging Platform   Problem definition

  Task granularity   Self-healing of workflow executions on grids

  Task granularity control process

  Experiments and results

  Conclusion

2 Rafael Ferreira da Silva – [email protected]

Page 3: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Outline

  Context   The Virtual Imaging Platform   Problem definition

  Task granularity   Self-healing of workflow executions on grids

  Task granularity control process

  Experiments and results

  Conclusion

3 Rafael Ferreira da Silva – [email protected]

Page 4: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Context   Virtual Imaging Platform (VIP)

  Medical imaging science-gateway

  Grid of ~180 sites (EGI – http://www.egi.eu)

  Significant usage   452 registered users from 50 countries

  Consumed 472 CPU years from August 2012 to July 2013 http://dirac.france-grilles.fr

4 Rafael Ferreira da Silva – [email protected]

VIP consumption since August 2012

Page 5: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Workflow Execution

Rafael Ferreira da Silva – [email protected]

2. User launches a simulation

3. MOTEUR generates invocations

4. GASW generates grid jobs

5. Jobs are submitted to DIRAC

6. Pilot jobs are submitted to EGI

1. Input data upload

7. Pilot jobs fetch grid jobs

8. Inputs download

10. Results upload

11. Download results

9. Execution

5

Page 6: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

  Low performance of lightweight (a.k.a. fine-grained) tasks:   High queuing times

  Communication overhead

Task Granularity

6 Rafael Ferreira da Silva – [email protected]

time

R1

R2

R3

t1

t2

t3

t4

t5

t1 t2

t3

t4

t5

Res

ourc

es

lightweight tasks Lightweight task executions are delayed

Group into coarse-grained tasks reduces the cost of data transfers

when grouped tasks share input data, and saves queuing time

Page 7: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Workflow Self-Healing

7 Rafael Ferreira da Silva – [email protected]

  Problem: costly manual operations   Rescheduling tasks, restarting services or replicating data files

  In this work: task granularity in distributed workflows

  Objective: automated platform administration   Autonomous detection of fine-grained tasks

  Perform appropriate set of actions

  Assumptions: online and non-clairvoyant   Only partial information available

  Decisions must be fast

  Production conditions, no user activity and workloads prediction

Page 8: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

General MAPE-K loop

8 Rafael Ferreira da Silva – [email protected]

Incident 1 degree η = 0.8

Incident 2 degree η = 0.4

Incident 3 degree η = 0.1

level 1

level2

level3

Roulette wheel selection

Incident 1

Selected

Rule Confidence (ρ) ρxη

2 1 0.8 0.32

3 1 0.2 0.02

1 1 1.0 0.80

Association rules for incident 1

Incident 2

Selected

Roulette wheel selection based on association rules

Set of Actions

x2

level 1

level2

level3

level 1

level2

level3

=ηiη jj=1

n∑

event (job completion and failures)

or timeout

Monitoring Analysis

Execution Knowledge

Planning

Monitoring data

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on distributed computing infrastructures, Future Generation Computer Systems (FGCS), in press, 2013.

Page 9: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

  Incident degrees are quantified in discrete incident levels

  Thresholds are determined from visual mode clustering or K-means

Incident Levels and Actions

9 Rafael Ferreira da Silva – [email protected]

No actions are triggered Triggers a set of actions

Thresholds cluster platform configurations into groups

Page 10: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Outline

  Context   The Virtual Imaging Platform   Problem definition

  Task granularity   Self-healing of workflow executions on grids

  Task granularity control process

  Experiments and results

  Conclusion

10 Rafael Ferreira da Silva – [email protected]

Page 11: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

  Task execution

  Incident degree

Fineness control: degree

11 Rafael Ferreira da Silva – [email protected]

η f =maxi∈[1,m ]{ f i = di ⋅ ri}

di =t~_ shared

t~_ shared + ni(t

~− t~_ shared )

ri =max j∈[1,ni ]

q j

max j∈[1,ni ]q j + t

~_ shared + ni(t

~− t~_ shared )

Queued Time Shared Input Data Other Input Data Application Execution

t~_ shared

t

q j

Median task phase durations

i = waiting task n = number of waiting tasks

Page 12: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Fineness control: task estimation   Estimation of task durations

  Job phases: setup inputs download execution outputs upload

  Assumption: bag of tasks (all jobs have equal durations)

  Median-based estimation:

12 Rafael Ferreira da Silva – [email protected]

Median duration of jobs phases

Real job duration

42s

300s

20s

?

42s

300s

400s*

15s

Estimated job duration

50s

250s

400s

15s

completed

current

*: max(400s, 20s) = 400s

t~

= 715s

t~i = 757s

Page 13: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Fineness control: levels and actions

13 Rafael Ferreira da Silva – [email protected]

  Levels: identified from the platform logs

  Actions   Task grouping

  Grouped pairwise until or the amount of waiting groups Q is smaller or equal to the amount of running groups R

τ f

Level 1 (no actions)

Level 2

action: task grouping

η f ≤ τ f

Page 14: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

  Levels   Incident degree

Coarseness control

14 Rafael Ferreira da Silva – [email protected]

ηc =R

Q+ R

τc = 0.5

time

R1

R2

R3

t1

t2

t3

t4

t5

t1

t2+t3

t4+t5

Res

ourc

es

Tasks at t1

t2+t3

t4+t5 Loss of parallelism

  Non-stationary load   Loss of parallelism

  Task-degrouping

t1 t2

Grouped tasks at t2

De-group tasks when R > Q

Page 15: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Workload for Case Studies   Based on the workload of VIP

  January 2011 to April 2012

  Case Studies on:   Pilot Jobs

  User accounting

  Task analysis

  Bag of tasks

  Workflows

112 users 2,941 workflow executions 680,988 tasks

338,989 completed

138,480 error

105,488 aborted

15,576 aborted replicas

48,293 stalled

34,162 queued 339,545 pilot jobs

15 Rafael Ferreira da Silva – [email protected]

R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executionss, CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.

Page 16: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Outline

  Context   The Virtual Imaging Platform   Problem definition

  Task granularity   Self-healing of workflow executions on grids

  Task granularity control process

  Experiments and results

  Conclusion

16 Rafael Ferreira da Silva – [email protected]

Page 17: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Experiment Conditions

17 Rafael Ferreira da Silva – [email protected]

  Experiment 1   Evaluate the fineness control process under stationary load

  Experiment 2   Evaluate the de-grouping control process under non-stationary load

  Workflows characteristics

Page 18: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

18

Results: stationary load

18 Rafael Ferreira da Silva – [email protected]

Fineness yields significant makespan reduction for all repetitions

Page 19: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

19

Results: stationary load (2)

19 Rafael Ferreira da Silva – [email protected]

Task grouping speed-ups SimuBloch and FIELD-II

up to a factor of 2.6, and PET-SORTEO/emission up

to a factor of 2.5

Not able to group all SimuBloch tasks in a single group because 2 tasks must be completed for the task estimation process

Page 20: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

20

Results: non-stationary load

20 Rafael Ferreira da Silva – [email protected]

Resources appear progressively Resources appear suddenly

Speeds up executions up to a factor of 1.5 for Fineness, and 2.1 for Fineness-Coarseness

Fineness is penalized by its lack of adaptation: slowdown of 20%

Page 21: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

21

Results: non-stationary load (2)

21 Rafael Ferreira da Silva – [email protected]

Linear correlation coefficient between the makespan and the average queuing time is 0.91, which indicates they are correlated

Page 22: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Outline

  Context   The Virtual Imaging Platform   Problem definition

  Task granularity   Self-healing of workflow executions on grids

  Task granularity control process

  Experiments and results

  Conclusion

22 Rafael Ferreira da Silva – [email protected]

Page 23: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Concluding remarks

23 Rafael Ferreira da Silva – [email protected]

  Context   Autonomous handling of unfairness among workflow executions   No strong assumptions on resource characteristics and workload

  Summary of the proposed method   Implements a generic MAPE-K loop   Determines task fineness based on queue waiting time and estimated

data transfer time of shared input data   Tasks are grouped pairwise as long as Q > R, and tasks are too fine   Tasks are ungrouped when the number of available resources increases

  Optimizing task granularity   Properly detects and handles lightweight tasks   Stationary load: fineness control significantly reduces the makespan of

all applications   Non-stationary load: de-grouping algorithm compensates lack of

adaptation of task grouping

Page 24: On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Rafael Ferreira da Silva – [email protected]

Thank you for your attention. Questions?

Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon

Lyon, France

On-line, Non-Clairvoyant Optimization of Workflow Activity Granularity on Grids

Acknowledgments: VIP users and project members

French National Agency for Research (ANR-09-COSI-03, ANR-11-LABX-0063) EC FP7 Programme (312579 ER-flow)

European Grid Initiative (EGI) France-Grilles