authors: mario cannataro 1 , carmela comito 2 , filippo lo schiavo 1 , and

35
Solving Environment (PSE) for Bioinformatics: Architecture and Experiments Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and Pierangelo Veltri 1 (February 2004) 1 University of Magna Graecia of Catanzaro, Italy 2 University of Calabria, Italy Presenter: Michael Robinson Agnostic: Javier Munoz Advanced Topics in Software Engineering CIS 6612 Florida International University July 31, 2006

Upload: enya

Post on 13-Jan-2016

18 views

Category:

Documents


1 download

DESCRIPTION

Proteus, a Grid based Problem Solving Environment (PSE) for Bioinformatics : Architecture and Experiments. Presenter: Michael Robinson Agnostic: Javier Munoz Advanced Topics in Software Engineering CIS 6612 Florida International University - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

Proteus, a Grid based Problem Solving Environment (PSE) for Bioinformatics: Architecture and Experiments

Authors: Mario Cannataro1, Carmela Comito2, Filippo Lo Schiavo1, and Pierangelo Veltri1 (February 2004) 1 University of Magna Graecia of Catanzaro, Italy 2 University of Calabria, Italy

Presenter: Michael Robinson Agnostic: Javier Munoz

Advanced Topics in Software Engineering CIS 6612 Florida International University July 31, 2006

Page 2: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

2

Organization

Abstract ~60% is about Bioinformatics Proteus Architecture First Test Implementation Results of First Test Conclusion and Future Work

Page 3: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

3

Abstract

Live sciences Bioinformatics Computer

Science

Data Files sizes

Computer power

Page 4: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

4

The Partners

What is Livesciences

What is Bioinformatics Other Sciences used in Bioinformatics

What is Computer Science

Page 5: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

5

Human Genome The sum total of DNA in an organism is its

genome.

The Human Genome Project (HGP) an international effort, began in October 1990, and was completed in 1999, 2003, 2004. (http://www.pbs.org/wgbh/nova/genome/program.html)

Project goals were to: Determine the complete sequence of the 3

billion DNA bases Identify all human genes And make them accessible for further

biological study

Page 6: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

6

Human Genome

The bacterium E. coli and others were used to help develop the technology and interpret human gene function.

The Human Genome Project was sponsored by:

The U.S. Department of Energy and The U.S. National Institutes of Health

http://www.preventiongenetics.com/edu/genetics_nutshell.htm

Page 7: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

7

DNA (ACGT)

Humans have from 10 to 100 trillion cells

Each Human cell has about 3 billion nucleotides

We have approximately 30,000 genes

Of the three billion letters of DNA that we have,

only 1 to 1.5 percent of it is gene the rest is STUFF”.

The functions are unknown for over 50% of known genes

Page 8: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

8

DNA (ACGT)

Human Genome

3,000,000,000 ~ dna bases 30,000,000 ~ bases in genes 2,970,000,000 ~ stuff

adenine (A) forms a base pair with thymine (T) guanine (G) forms a base pair with cytosine (C)

Page 9: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

9

Similarities to Human DNA Another

human? 99.9% - All humans have the same genes, but some of these genes contain sequence differences that make each person unique.

A chimpanzee? 98.5% - Chimpanzees are the closest living species to humans.

A mouse? 92.0% - All mammals are quite similar genetically.

A fruit fly? 44.0% - Studies of fruit flies have shown how shared genes govern the growth and structure of both insects and mammals.

Yeast? 26.0% - Yeasts are single-celled organisms, but they have many housekeeping genes that are the same as the genes in humans, such as those that enable energy to be derived from the breakdown of sugars.

A weed (thale cress)?

18.0% - Plants have many metabolic differences from humans. For example, they use sunlight to convert carbon dioxide gas to sugars. But they also have similarities in their housekeeping genes.

Page 10: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

10

The gene sizes Largest known human gene is dystrophin at 2.4 million bases.

Chromosome 21 is the smallest human chromosome. Three copies of this autosome causes Down syndrome, the most frequent genetic disorder associated with significant mental retardation.

Academic groups from Germany and Japan mapped and sequenced it, it has 33,546,361 bp of DNA

Analysis of the chromosome revealed: 127 known genes, 98 predicted genes, and 59 pseudogenes.

Smallest bacterial genome, Mycoplasma genitalium size of 580 kbp

Page 11: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

11

Bioinformatics

DNA RNA PROTEINS

MUTATIONS, ILLNESSES

MEDICATIONS

CLONING

Page 12: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

12

DNA (ACGT)

Pseudomonas Aeruginosas PA016,264,403 bases, 5565 genes

complement(6264226..6264360)6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga

cggtcagacg6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg

tggccatacg6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag

tacgtttcat6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg

Page 13: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

13

RNA In RNA, thymine is replaced by uracil (U).

DNA6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga

cggtcagacg6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg

tggccatacg6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag

tacgtttcat6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg

RNA6264181 gcuugucccg gucgaagucc cgacucacca cccguaccgg auaaaucaga

cggucagacg6264241 cuuacggccu uuggcgcgac gacgcgacag aaccugacgg ccguucuugg

uggccauacg6264301 ggcgcggaaa ccguggacgc gagcgcgcuu gagggugcug gguuggaaag

uacguuucau6264361 gauucgguac cuggguugac gacuugaggu cgcagugacc ccg

Page 14: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

14

Amino Acids

UUU F phe Phenylalanine UUG V val Valine UAU Y tyr Tyrosine UGU C cys Cysteine

UUC F phe Phenylalanine UCC S ser Serine UAC Y tyr Tyrosine UGC C cys Cysteine

UUA L leu Leucine UCA S ser Serine UAA Stop UGA Stop

UUG L leu Leucine UCG S ser Serine UAG Stop UGG W trp Tryptophan

CUU L leu Leucine CCU P pro Proline CAU H his Histedine CGU R srg Arginine

CUC L leu Leucine CCC P pro Proline CAC H his Histedine CGC R srg Arginine

CUA L leu Leucine CCA P pro Proline CAA Q gln Glutamine CGA R srg Arginine

CUG L leu Leucine CCG P pro Proline CAG Q gln Glutamine CGG R srg Arginine

AUU l lle Isoleucine ACU T thr Threonine AAU N asn Asparagine AGU S ser Serine

AUC l lle Isoleucine ACC T thr Threonine AAC N asn Asparagine AGC S ser Serine

AUA l lle Isoleucine ACA T thr Threonine AAA K lys Lysine AGA R arg Arginine

AUG M met Methionime Start ACG T thr Threonine AAG K lys Lysine AGG R arg Arginine

GUU V val Valine GCU A ala Alanine GAU D asp Aspartic GGU G gly Glycine

GUC V val Valine GCC A ala Alanine GAC D asp Aspartic GGC G gly Glycine

GUA V val Valine GCA A ala Alanine GAA Z glu Glutamic GGA G gly Glycine

GUG V val Valine GCG A ala Alanine GAG Z glu Glutamic GGG G gly Glycine

U

C

A

G

U

C

A

G

U

C

A

G

U

C

A

G

U

C

A

G

U C A G

Page 15: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

15

Proteins (sequences)DNA6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga cggtcagacg6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg tggccatacg6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag tacgtttcat6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg

RNA6264181 gcuugucccg gucgaagucc cgacucacca cccguaccgg auaaaucaga

cggucagacg6264241 cuuacggccu uuggcgcgac gacgcgacag aaccugacgg ccguucuugg

uggccauacg6264301 ggcgcggaaa ccguggacgc gagcgcgcuu gagggugcug gguuggaaag

uacguuucau6264361 gauucgguac cuggguugac gacuugaggu cgcagugacc ccg

PROTEIN MKRTFQPSTLKRARVHGFRARMATKNGRQVLSRRRAKGRKRLTV

Page 16: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

16

Proteins: Pattern Matching

G-H-E-X(2)-G-X(4,5)-[GA]

GHEGVGKVVKLGAGA GHEKKGYF-DRGPSA GHEGYGGRSRGGGYS GHEFEGPK-CGALYI GHELRGTTFMPALEC

Page 17: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

17

Proteins: Structures Chemical properties that distinguish the 20 different

amino acids cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell

Page 18: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

18

Reality Somewhere in this dense chemical forest

are genes involved in deafness, Alzheimer, cancer, cataracts, etc.

But where? This is such a maze scientists need a map.

Out of three billion base pairs in our DNA, just one single letter can make a difference.

Page 19: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

19

Data Locations GenBank in the US, 1974 1997 = 1.26

gigabases http://www.ncbi.nlm.nih.gov/ 2004 = 39

gigabases 2005 = 100

gigabases EMBL in England, 1980 http://www.ebi.ac.uk/embl/

DDBJ in Japan, 1984 http://www.ddbj.nig.ac.jp/

Page 20: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

20

Some Databases

The Swiss Institute of Bioinformatics maintains the following databases:

Ashbya Genome Database Cancer Immunome Database Eukaryotic Promoter Database (EPD) GermOnline MyHits PROSITE Swiss-Prot and TrEMBL SWISS-2DPAGE SWISS-MODEL Repository

Page 21: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

21

Specialization Plasmodb http://

www.plasmodb.org/plasmo/home.jsp parasitic eukaryote Plasmodium the

causative agent of the disease Malaria. [email protected]

 

Page 22: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

22

Proteus General Architecture

Page 23: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

23

Proteus’ Software Modules

Page 24: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

24

Some Taxonomies of the Bioinformatics Ontology

Page 25: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

25

Snapshot of the Ontology Browser

Page 26: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

26

Human Protein

Clustering

Workflow

Page 27: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

27

Snapshot of VEGA: Workspace 1 of the Data Selection Phase

Page 28: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

28

Software Installed in the Example Grid

Software Components

Grid Nodes

Minos k3 k4

segret *

splitfasta *

blastall * * *

cat * * *

Tribe-parse * * *

Tribe-matrix *

mcl *

Tribe-families *

Page 29: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

29

Snapshot of the Ontology Browser

Page 30: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

30

Snapshot of the Ontology Browser

Page 31: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

31

Snapshot of the Ontology Browser

Page 32: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

32

Snapshot of VEGA: Workspace 1 of the Pre-processing Phase

Page 33: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

33

Conclusions and Future Work Execution Times of the Application

TribeMCL Application 30 Proteins All Proteins

Data Selection 1’44” 1’41”

Pre-Processing 2’50” 8h50’13”

Clustering 1’40” 2h50’28”

Results Visualization 1’14” 1’42”

Total Execution Time 7’28” 11h50’53”

Page 34: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

34

References

On the paper the authors cited 27 references

Page 35: Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

35

Questions

Thank you