deployment and preparation of metagenomic analysis on the eela grid gabriel aparício, et al

26
Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Upload: laura-turlington

Post on 01-Apr-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Deployment and Preparation of Metagenomic Analysis on the EELA

GridGabriel Aparício, et al.

Page 2: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Topic summary

Global Process Topics

• Introduction• Cases of study• Deployment• Automation System• Results and Performances Analysis• Future Plans

Page 3: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Introduction

• What is a Metagenome?• What is a Metagenome Analysis?• Why Grid is a Good Solution?• Which is the Proposed Structure?• Which are the Future Plans?

Page 4: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

What is a Metagenome?

Introduction

• Term first used by Jo Handelsman and others in the University of Wisconsin in 1998.

• A metagenome is a collection of genes.

• It can be studied as a single gene.• Analysis can be done without

isolating genes and lab-cultivating them.

Page 5: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

What is a Metagenome Analysis?

Introduction

• A Metagenome Analysis is the group of necessary steps to transform a file of a coded metagenome into another file with some interest information.

• This can include:– Database filtering.– BLAST alignments.– BLAST output filtering.– Creation of Phylogenetic Trees.

Page 6: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Why Grid is a Good Solution?

Introduction

• A Metagenome can be coded into several hundred of thousand sequences.

• Sequential time can take more than a year.• Public databases are continuously changing.• Several coarse steps can be done in parallel.• In a Grid, the global job can be divided into

subjobs.• A Metagenome Analysis can be processed in

a few days with a Grid Infrastructure.

Page 7: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Farm Soil Metagenome

Cases of Study

• This is a sample from a nutrient rich, moderately contaminated soil environment.

• This community is very diverse and complex.

• Many yet unknown enzymes are probably present there.

• Its study is very interesting from the biotechnological point of view.

Page 8: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Whale fall Metagenome

Cases of Study

• Whale carcasses are known to be a nutrient-rich environment in the bottom of the ocean.

• A heterogeneous mixture of bacteria flourish there.

• It is one of the best examples of marine bacterial communities.

Page 9: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Sargasso Sea Metagenome

Cases of Study

• These oceanic samples are taken from surface waters.

• They represent the diversity of bacteria that live planktonically.

Page 10: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Gut Metagenomes

Cases of Study

• Several metagenomes of the human intestinal microbiota.

• This consortia of bacteria helps its host to metabolize many nutrients that would be indigestible otherwise.

• It is involved in other functions–Maturation and modulation of the

immune response of the host.– Prevention of infection by bacterial

pathogens.

Page 11: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Sequential or Parallel jobs? (I)

Deployment

• There are around 150 CEs in BIOMED and EELA VOs.

• There are only around 30 CEs able to run MPICH jobs.

• The number of CEs decreases when the number of required nodes increases.

• Full efficiency in MPICH jobs is achieved occasionally.

Page 12: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Sequential or Parallel jobs? (II)

Deployment

Page 13: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Selecting CEs (I)

Deployment

• Several jobs are needed– A single job can take more than a year.– It is needed to split the Analysis into

several subjobs (often more than 100 subjobs).

• Several CEs are needed– To decentralize processing, storing and

network bandwidth.

• A Metagenome Analysis job has requirements– On software, hardware and configuration.

Page 14: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Selecting CEs (II)

Deployment

• Not all available CEs are able to produce results.

• Not all available CEs have the same performance.

• It is needed to select CEs and to distribute jobs according to their performance.

Page 15: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Selecting CEs (III)

Deployment

Page 16: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Selecting SEs and Replicating Files

Deployment

• All jobs need certain common files.• These files have to be replicated to

increase performance and to distribute network bandwidth.

• SEs selected will be located according to their geographical and administrative nearness to selected CEs, their performance and their configuration.

Page 17: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Splitting global job

Deployment

• The global job has to be broken down into subjobs.

• The subjob lifetime will decrease– Increase interactivity.– Improve monitoring capabilities.

Page 18: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Submitting Jobs

Automation System

• Subjobs are assigned to a list of CEs• These CEs have been tested.• Assignation is done according to

obtained performances in previous experiments.

Page 19: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Monitoring Jobs

Automation System

• Periodically, jobs status are monitored.

• In case of errors (aborted job, bad results, etc.), the job is automatically resubmitted.

• In case the job is running too long, the job is cancelled and resubmitted.

• In case the job has finished successfully, its CEs is annotated for later submissions.

Page 20: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Resubmitting Jobs

Automation System

• Each correctly finished job annotates its CEs and puts it into a list.

• The jobs are resubmitted to a random CE of this list.

• If the list does not exist, the job is submitted to a random CE.

Page 21: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Retrieving Results

Automation System

• Once results are available, they are downloaded and the standard outputs are explored to find any error.

• A retrieved job is no longer monitored.

Page 22: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

First conclusions

Results and Performances

• Jobs are too long to run sequentially– Sargasso Sea Metagenome takes 512 days.

• The same job in Grid takes 13 days to be fully finished.– Speedup is around 40.

• High speed for most jobs (90% in 7 days)– Speedup is around 80.– No need to finish all jobs to begin with new

stages.

Page 23: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Correctly finished jobs percentage

Results and Performances

Page 24: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Sequences processed per hour

Results and Performances

Page 25: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Future plans

Future plans

• To create several shell-scripts with different stages depending on the desired results.

• To increase cases of study.• To improve automation

performances.• To make a report with the issues and

lessons learnt in EGEE and EELA infrastructures.

Page 26: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al

Contact

Contact

Gabriel Aparício i PlaIgnacio Blanquer Espert

Vicente Hernández García

Universitat Politècnica de ValènciaCamí de Vera, s/n

46022 València, SpainEmails: [email protected]

[email protected]@itaca.upv.es