frank dehne parallel computational biochemistry
TRANSCRIPT
Frank Dehne www.dehne.net
Proteins, DNA, etc.
DNA encodes the information necessary to produce proteins
Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes)
Frank Dehne www.dehne.net
• Proteins are formed from a chain of molecules called amino acids
Proteins, DNA, etc.
Frank Dehne www.dehne.net
• The DNA sequence encodes the amino acid sequence that constitutes the protein
Proteins, DNA, etc.
Frank Dehne www.dehne.net
• There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I, ...
Proteins, DNA, etc.
Frank Dehne www.dehne.net
Databases of Biological Sequences
>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
NCBI: 14,976,310 sequences
15,849,921,438 nucleotides
Swiss-Prot: 104,559 sequences
38,460,707 residues
PDB: 17,175 structures
Frank Dehne www.dehne.net
Sequence comparison
• Compare one sequence (target) to many sequences (database search)
• Compare more than two sequences simultaneously
Frank Dehne www.dehne.net
Applications
• Phylogenetic analysis
• Identification of conserved motifs and domains
• Structure prediction
Frank Dehne www.dehne.net
Structure Prediction
Genomic sequences
> RICIN GLYCOSIDASEMYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
Protein sequences
Protein structures
Frank Dehne www.dehne.net
Our Contributions
• Parallel min vertex cover for improved sequence alignments (to appear in Journal of Computer and System Sciences)
• Parallel Clustal W (ICCSA 2003)
• In progress: “Clustal XP” portal at http://cgm.dehne.net
Frank Dehne www.dehne.net
Progressive Alignment
Scerevisiae [1]Celegans [2] 0.640Drosophia [3] 0.634 0.327Human [4] 0.630 0.408 0.420Mouse [5] 0.619 0.405 0.469 0.289
S.cerevisiaeC.elegans
DrosophilaMouse
Human
1. Do pairwise alignment of all sequences and calculate distance matrix
2. Create a guide tree based on this pairwise distance matrix
3. Align progressively following guide tree. • start by aligning most closely related pairs of sequences• at each step align two sequences or one to an existing subalignment
Frank Dehne www.dehne.net
Parallel Clustal
• Parallel pairwise (PW) alignment matrix
• Parallel guide tree calculation
• Parallel progressive alignment
Scerevisiae [1]Celegans [2] 0.640Drosophia [3] 0.634 0.327Human [4] 0.630 0.408 0.420Mouse [5] 0.619 0.405 0.469 0.289
S.cerevisiaeC.elegans
DrosophilaMouse
Human
Frank Dehne www.dehne.net
Clustal XP vs. SGI
SGI data taken from Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTAL
By: Dmitri Mikhailov, Haruna Cofer, and Roberto Gomperts
Frank Dehne www.dehne.net
Parallel Clustal - Improvements
• Optimization of input parameters– scoring matrices, gap penalties - requires many
repetitive Clustal W calculations with various input parameters.
• Minimum Vertex Cover– use minimum vertex cover to remove erroneous
sequences, and identify clusters of highly similar sequences.
Frank Dehne www.dehne.net
Minimum Vertex Cover
Conflict Graph– vertex: sequence
– edge: conflict (e.g. alignment with very poor score)
TASK: remove smallest number of gene sequences that eliminates all conflicts
NP-complete
Frank Dehne www.dehne.net
FPT Algorithms
• Phase 1: Kernelization
Reduce problem to size f(k)
• Phase 2: Bounded Tree Search
Exhausive tree search; exponential in f(k)
Frank Dehne www.dehne.net
Kernelization
Buss's Algorithm for k-vertex cover
• Let G=(V,E) and let S be the subset of vertices with degree k or more.
• Remove S and all incident edges
G->G’ k -> k'=k-|S|.
• IF G' has more than k x k' edges THEN no k-vertex cover exists
ELSE start bounded tree search on G'
Frank Dehne www.dehne.net
Bounded Tree Search
VC={}
VC+=... VC+=... VC+=...
VC+=... VC+=... VC+=...
VC+=... VC+=... VC+=...
Frank Dehne www.dehne.net
Case 1: simple path of length 3
VC+={v,v2}
VC={...}
VC+={v1,v2} VC+={v1,v3}
search tree
v
v1
v2
v3
in graph G'
remove selected vertices from G'k' - = 2
Frank Dehne www.dehne.net
Case 2: 3-cycle
v
v1
v2
in graph G'
VC+={v,v1}
VC={...}
VC+={v1,v2} VC+={v,v2}
search tree
remove selected vertices from G'k' - = 2
Frank Dehne www.dehne.net
Case 3: simple path of length 2
v
v1
v2
in graph G'
VC={...}
VC+={v1}
search tree
remove v1, v2 from G'k' - = 1
Frank Dehne www.dehne.net
Case 4: simple path of length 1
v
v1
in graph G'
VC={...}
VC+={v}
search tree
remove v, v1 from G'k' - = 1
Frank Dehne www.dehne.net
Sequential Tree Search
Depth first search– backtrack when k'=0 and
G'<>0 ("dead end" ))
– stop when solution found (G'={}, k'>=0 )
Frank Dehne www.dehne.net
Parallel Tree SearchBasic Idea:
– Build top log p levels of the search tree (T ')
– every proc. starts depth-first search at one leaf of T '
– randomize depth-first search by selecting random child
T 'log p
Frank Dehne www.dehne.net
Analysis: Balls-in-bins
sequential depth-first search path total length:L, #solutions: m
expected sequential time (rand. distr.): L/(m+1)
parallel search path
expected parallel time (rand. distr.): p + L/(p(m+1))expected speedup: p / (1 + (m+1)/L)if m << L then expected speedup = p
Frank Dehne www.dehne.net
Simulation Experiment
number of processors
0 50
50
pre
dict
ed s
pee
dup
L = 1,000,000
m = 10m = 100m = 1,000m = 10,000m = 100,000
100
150
200
100 150 200
L = 1,000,000
Frank Dehne www.dehne.net
Implementation
• test platform:– 32 node HPCVL Beowulf cluster– each node: dual 1.4 GHz Intel Xeon, 512 MB
RAM, 60 GB disk– gcc and LAM/MPI on LINUX Redhat 7.2
• code-s: Sequential k-vertex cover
• code-p: Parallel k-vertex cover
Frank Dehne www.dehne.net
Test Data
• Protein sequences
• Same protein from several hundred species
• Each protein sequence a few hundred amino acid residues in length
• Obtained from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)
Frank Dehne www.dehne.net
Test Data
• Somatostatin
– neuropeptide involved in the regulation of many functions in different organ systems
– Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k' = 255
Frank Dehne www.dehne.net
Test Data
• WW
– small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling
– Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k' = 318
Frank Dehne www.dehne.net
Test Data
• Kinase
– large family of enzymes involved in cellular regulation
– Clustal Threshold = 16, |V| = 647, |E| = 113122, k = 497, k' = 397
Frank Dehne www.dehne.net
Test Data
• SH2 (src-homology domain 2)
– involved in targeting proteins to specific sites in cells by binding to phosphor-tyrosine
– Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k' = 397
Frank Dehne www.dehne.net
Test Data
• Thrombin
– protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin
– Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k' = 413
Frank Dehne www.dehne.net
Test Data
• PHD (pleckstrin homology domain)
– involved in cellular signaling
– Clustal Threshold = 10, |V| = 670, |E| = 147054, k = 603, k' = 603
Frank Dehne www.dehne.net
Test Data
• Random Graph
|V| = 220, |E| = 2155, k = 122, k' = 122
• Grid Graph
|V| = 289, |E| = 544, k = 145, k' = 145
Frank Dehne www.dehne.net
Clustal W
+Parallel Clustal
…
Parallel FPT MVC
Clustal XP
Web Portal
Clustal XPin progress X : Extended
P : Parallel