estimating the performance of blast runs on the egee grid

19
EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández Estimating the Performance of BLAST runs on the EGEE Grid

Upload: brendy

Post on 04-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Estimating the Performance of BLAST runs on the EGEE Grid. Abel Carrión Ignacio Blanquer Vicente Hernández. Outline. The problem. Factors affecting the performance. Experiments on the grid. The Performance. Table of performance per node. Execution model. Conclusions and further work. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Estimating the Performance of BLAST runs on the EGEE Grid

EGEE-III INFSO-RI-222667

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered trademarks

Abel Carrión

Ignacio Blanquer

Vicente Hernández

Estimating the Performance of BLAST runs on the EGEE Grid

Page 2: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Outline

• The problem.• Factors affecting the performance.• Experiments on the grid.• The Performance.• Table of performance per node.• Execution model.• Conclusions and further work.

Uppsala – User Forum 12/4/10 2

Page 3: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Introduction: The Problem

• Sequence alignment is an key operation in Bioinformatics– It involves computing the comparison of proteomic and genomic

samples with respect to annotated databases.

• It is a part of many Bioinformatics pipelines– Used to search for homologous in the study the functionality of

different genes and regions. – Used in the phylogenetic taxonomy.

• There are many tools developed in the literature– Based on the Smith-Waterman transform (e.g. BLAST).– Based on Hash Tables (e.g. SSAHA, BLAT).– Based on Burroughs-Wheeler Transform (e.g. BWA, Bowtie).– And combinations of them (e.g. SSAHA2).

Uppsala – User Forum 12/4/10 3

Page 4: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Introduction: The Tool

• BLAST (Basic Local Alignment Search Tool) is the most widely used tool for performing the alignment of any length novel sequences against the ones contained in a determined database.– Although it could be inefficient for many cases, It has a proven

reputation.

• Because a normal use case entails the alignment of millions of sequences, this kind of experiments are very computationally intensive (it demands years of CPU computation).

Uppsala – User Forum 12/4/10 4

Page 5: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Introduction: The Approach

• Problem is massive parallel and fits the requirements for using efficiently Grid infrastructures.

• The parallelization process follows the High-Throughput

paradigm:– Segmenting the input data file

into several chunks which are aligned in independent computation nodes.

Uppsala – User Forum 12/4/10 5

Page 6: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Introduction: The Issues

• Two factors are key for the general performance– A good selection of the resources for computing and storage.– A good partition strategy.

• This not only affects the response time, but also the failure ratio– Queues have a limitation in the maximum job executing time.– Fault-tolerance automatic resubmission also needs to know if a

job is executing slowly or simply it is blocked.

• Thus, a key issue when performing the scheduling of thousands of jobs is estimating the response time of the tasks.– Create a model to estimate the execution time of a job.

Uppsala – User Forum 12/4/10 6

Page 7: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Factors affecting the performance

• Quasi-Deterministic Factors– Application dependant.

Input data file size. DB file size. Similarity. Number of hits.

– Resource dependant. SPECint (SI00). SPECfp (SF00). Memory.

• Undetermisitic Factors– Load dependant

Queue size Average waiting time

– Site dependant Site availability. Specific job failure rates.

Uppsala – User Forum 12/4/10 7

Obtained through experiments Not yet covered in the study

Page 8: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Influence of the data size

• The purpose of this experiment was to analyse the influence of the data size (input file and database).

• Using the UniProt database, several files with different sizes were generated and executed in the same machine.

• The results show that the input file size and database size have a direct linear impact on the response time.

Uppsala – User Forum 12/4/10 8

Page 9: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Influence of the similarity

• The heuristic nature of BLAST accelerates the comparison of two clearly unrelated sequences

• To check the influence of the similarity on the searches, three new versions of the UniProt database were produced, replacing 1%, 5% and 10% of their contents respectively.

• As it can be seen, the response time is independent from the similarity between sequences.– This factor will not be

considered in the final performance model.

Uppsala – User Forum 12/4/10 9

Page 10: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Influence of the parameter bhits

• An experiment has been executed in different computers for different values of the “BLAST Hits” argument– Values range from Xxxx to XXX.– Although the number of results produced in the output increases

accordingly, no effect is observed on the response time nor the failure rate.

Uppsala – User Forum 12/4/10 10

Page 11: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Influence of Resource Performance

• The GlueSchema includes the values of Benchmarks SpecInt 2000 and SpecFloat 2000.

• 9 Different job types have been executed on 19 sites where this information was published.

EGEE'08 - Vangelis Floros 11

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 500 1000 1500 2000 2500 3000 3500 4000

T1

T2

T3

T4

T5

T6

T7

T8

T9

Page 12: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Spec Benchmark

Uppsala – User Forum 12/4/10 12

• The SpecInt and SpecFloat benchmarks do not seem to be correlated with actual performance for BLAST– Correlation index ranges from 0.08 to 0.55 with poor values of

significance (up to 77% of randomness).

• This might be caused by different factors– Unsuitability of the benchmark for BLAST.– Lack of accuracy of the benchmark, which is not computed in

many sites but obtained from tables.

• Moreover, the values of the benchmarks are not always published.

• A new estimator is needed.

T1 T2 T3 T4 T5 T6 T7 T8 T9Pearson -0,433 -0,309 -0,533 -0,42 -0,521 -0,549 -0,413 0,079 0,55p 0,052 0,175 0,014 0,06 0,017 0,011 0,081 0,772 0,089

Page 13: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Empirical estimator

• Ratio of average performance with respect to a fixed node– Same tests are repeated in all nodes for different data sizes and

speed-ups/downs are computed.– A single coefficient is obtained from the sum of all the computing

times of all (the same number and type) of jobs.

• Obviously, much more correlation is shown (high significance, always above 98%, but generally above 99,99%).

Uppsala – User Forum 12/4/10 13

T1 T2 T3 T4 T5 T6 T7 T8 T9

Pearson 0,888 0,788 0,957 0,897 0,978 0,96 0,889 0,691 0,73

p1,86E-

073,33E-

056,65E-

118,72E-

082,10E-

133,12E-

118,33E-

07 0,004 0,015

Page 14: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Empirical estimator

Uppsala – User Forum 12/4/10 14

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0,7 0,9 1,1 1,3 1,5 1,7 1,9 2,1 2,3

1

4

9

16

25

36

49

64

81

Page 15: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Estimating the model

• Current conclusions– Direct (linear) dependence on the Input data size.– Direct (linear) dependence on the database size.– No dependency on the similarity or number of hits.– Direct dependence on the process speed factor.– Low dependence on the memory, except for saturations.

• Proposed model

•  Where– PSp, PTm and K are the unknown parameters of the regression model.

– Tinp and TBD are the fixed values for the size of data and database.

– Sp is the ratio between the response time in a reference site and each other site.

– Tinp_basal is 0,5 y TBD_basal is 50.

Uppsala – User Forum 12/4/10 15

KT

T

T

TP

Sp

PT

basalBD

BD

basalinp

inpTm

Sp )(__

Page 16: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Real execution time per node

Uppsala – User Forum 12/4/10 16

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 10 20 30 40 50 60 70 80 90

381

390

711

838,5

976

1031

1425

1931

2000

2000

2002

2010,5

2021

2100

2131

2615,5

2700

3500

3587

Page 17: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Different model for each node

Uppsala – User Forum 12/4/10 17

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0 20 40 60

Series1

Series2

Series3

Series4

Series5

Series6

Series7

Series8

Series9

Series10

Page 18: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Single model with the speed adjustment

Uppsala – User Forum 12/4/10 18

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0 20 40 60

Series1

Series2

Series3

Series4

Series5

Series6

Series7

Series8

Series9

Series10

Series11

5.1065)(62.83535.1

__

basalBD

BD

basalinp

inp

T

T

T

T

SpT

Page 19: Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Conclusions and further work

• This work presents a performance model for estimating the response time of BLAST runs in the EGEE grid– Direct dependence with the performance indicator and the data

size of input and reference database.– No direct dependence with memory or blast hits.

• Very important parameter for load balancing and pre-emptive resubmission.

• The work will be extended, introducing new parameters from other components– Workload Manager System.– Local Resource Management Systems.– Bandwidth between CEs y SEs.

Uppsala – User Forum 12/4/10 19