snapdragon: protein 3d prediction-based domaination: based on psi-blast
DESCRIPTION
Two methods to predict domain boundary sequence positions from sequence information alone. SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST. An example of two different bioinformatics approaches to the same problem. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/1.jpg)
SnapDRAGON: protein 3D prediction-based
DOMAINATION: based on PSI-BLAST
Two methods to predict domain boundary sequence positions from
sequence information alone
An example of two different bioinformatics approaches to the same problem
![Page 2: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/2.jpg)
SnapDRAGON
Richard A. George
Jaap Heringa
George, R.A. & Heringa, J. (2002) J.Mol.Biol. 316,839-851
George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.
Combining protein secondary and tertiary structure prediction to predict structural domains in sequence data
![Page 3: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/3.jpg)
Protein structure evolutionInsertion/deletion of secondary structural
elements can ‘easily’ be done at loop sites
![Page 4: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/4.jpg)
Flavodoxin family - TOPS diagrams (Flores et al., 1994)
1 2345
1
234
5
![Page 5: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/5.jpg)
Protein structure evolutionInsertion/deletion of structural domains can
‘easily’ be done at loop sites
N
C
![Page 6: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/6.jpg)
A domain is a:
• Compact, semi-independent unit (Richardson, 1981).
• Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973).
• Recurring functional and evolutionary module (Bork, 1992).
“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).
![Page 7: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/7.jpg)
The DEATH Domain• Present in a variety of Eukaryotic proteins involved with cell death.• Six helices enclose a tightly packed hydrophobic core.• Some DEATH domains form homotypic and heterotypic dimers.
http
://w
ww
.msh
ri.o
n.ca
/paw
son
![Page 8: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/8.jpg)
Delineating domains is essential for:
• Obtaining high resolution structures (x-ray, NMR)• Sequence analysis • Multiple sequence alignment methods• Prediction algorithms (SS, Class, secondary/tertiary
structure)• Fold recognition and threading• Elucidating the evolution, structure and function of
a protein family (e.g. ‘Rosetta Stone’ method)• Structural/functional genomics• Cross genome comparative analysis
![Page 9: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/9.jpg)
Pyruvate kinasePhosphotransferase
barrel regulatory domain
barrel catalytic substrate binding domain
nucleotide binding domain
1 continuous + 2 discontinuous domains
Structural domain organisation can be nasty…
![Page 10: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/10.jpg)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
![Page 11: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/11.jpg)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
![Page 12: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/12.jpg)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
![Page 13: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/13.jpg)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
![Page 14: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/14.jpg)
Distance Regularisation Algorithm for Geometry OptimisatioN
(Aszodi & Taylor, 1994)
Domain prediction using DRAGON
•Folds proteins based on the requirement that (conserved) hydrophobic residues cluster together.
•First constructs a random high dimensional C distance matrix.
•Distance geometry is used to find the 3D conformation corresponding to a prescribed target matrix of desired distances between residues.
![Page 15: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/15.jpg)
The DRAGON target matrix is inferred from:
• A multiple sequence alignment of a protein (old)– Conserved hydrophobicity
• Secondary structure information (SnapDRAGON)– predicted by PREDATOR (Frishman & Argos, 1996).– strands are entered as distance constraints from the N-
terminal Cto the C-terminal C
![Page 16: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/16.jpg)
•The C distance matrix is divided into smaller clusters.
•Seperately, each cluster is embedded into a local centroid.
•The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures.
3NN
NN
C distancematrix
Targetmatrix
N
CCHHHCCEEE
Multiple alignment
Predicted secondary structure100 randomised
initial matrices
100 predictions Input data
![Page 17: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/17.jpg)
SnapDragon
Generated folds by Dragon
Boundary recognition
Summed and Smoothed Boundaries
CCHHHCCEEE
Multiple alignment
Predicted secondary structure
![Page 18: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/18.jpg)
Domains in structures assigned using method by Taylor (1997)
Domain boundary positions of each model against sequence
Summed and Smoothed Boundaries (Biased window protocol)
SnapDRAGON
1
2
3
![Page 19: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/19.jpg)
Prediction assessment
• Test set of 414 multiple alignments;183 single and 231 multiple domain proteins.
Sequence searches using PSI-BLAST (Altschul et al., 1997) followed by redundancy filtering using OBSTRUCT (Heringa et al.,1992) and alignment by PRALINE (Heringa, 1999)
• Boundary predictions are compared to the region of the protein connecting two domains (min 10 residues)
![Page 20: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/20.jpg)
Continuous set Discontinuous set Full set
SnapDRAGONCoverage 63.9 (± 43.0) 35.4 (± 25.0) 51.8 (± 39.1)
Success 46.8 (± 36.4) 44.4 (± 33.9) 45.8 (± 35.4)
Baseline 1Coverage 43.6 (± 45.3) 20.5 (± 27.1) 34.7 (± 40.8)
Success 34.3 (± 39.6) 22.2 (± 29.5) 29.6 (± 36.6)
Baseline 2Coverage 45.3 (± 46.9) 22.7 (± 27.3) 35.7 (± 41.3)
Success 37.1 (± 42.0) 23.1 (± 29.6) 31.2 (± 37.9)
Average prediction results per protein
Coverage is the % linkers predicted (TP/TP+FN)Success is the % of correct predictions made (TP/TP+FP)
![Page 21: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/21.jpg)
SnapDRAGON
• Is very slow (can be hours for proteins>400 aa) – cluster computing implementation
• Uses consistency in the absence of standard of truth
• Goes from primary+secondary to tertiary structure to ‘just’ chop protein sequences
• SnapDRAGON webserver is underway
![Page 22: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/22.jpg)
DOMAINATIONRichard A. George
Protein domain identification and improved sequence searching using PSI-BLAST
(George & Heringa, Prot. Struct. Func. Genet., in press; 2002)
Integrating protein sequence database searching and domain recognition
![Page 23: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/23.jpg)
Domaination
• Current iterative homology search methods do not take into account that:– Domains may have different ‘rates of
evolution’.– Common conserved domains, such as the
tyrosine kinase domain, can obscure weak but relevant matches to other domain types
– Premature convergence (false negatives)– Matrix migration / Profile wander (false
positives).
![Page 24: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/24.jpg)
PSI-BLAST• Query sequence is first scanned for the presence of so-
called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition (e.g. TM regions or coiled coils) likely to lead to spurious hits, which are excluded from alignment.
• Initially operates on a single query sequence by performing a gapped BLAST search
• Then takes significant local alignments found, constructs a ‘multiple alignment’ and abstracts a position specific scoring matrix (PSSM) from this alignment.
• Rescans the database in a subsequent round to find more homologous sequences -- Iteration continues until user decides to stop or search converges
![Page 25: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/25.jpg)
PSI-BLAST iteration
Q
ACD..Y
PiPx
Query sequence
PSSM
Q Query sequence
Gapped BLAST search
Database hits
Gapped BLAST searchACD..Y
PiPx
PSSM
Database hits
xxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
![Page 26: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/26.jpg)
DO
MA
INA
TIO
N
Chop and JoinDomains
![Page 27: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/27.jpg)
Post-processing low complexityRemove local fragments with > 15% LC
![Page 28: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/28.jpg)
Identifying domain boundaries
Sum N- and C-termini ofgapped local alignments
True N- and C- termini are counted twice (within 10 residues)
Boundaries are smoothed using twowindows (15 residues long)
Combine scores using biased protocol:
if Ni x Ci = 0then Si = Ni+Cielse Si = Ni+Ci +(NixCi)/(Ni+Ci)
![Page 29: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/29.jpg)
Identifying domain deletions
• Deletions in the query (or insertion in the DB sequences) are identified by– two adjacent segments in the query align to the
same DB sequences (>70% overlap), which have a region of >35 residues not aligned to the query. (remove N- and C- termini)
DBQuery
![Page 30: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/30.jpg)
Identifying domain permutations
• A domain shuffling event is declared – when two local alignments (>35 residues)
within a single DB sequence match two separate segments in the query (>70% overlap), but have a different sequential order.
DB
Query
b a
a b
![Page 31: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/31.jpg)
Identifying continuous and discontinuous domains
•Each segment is assigned an independence score (In). If In>10% the segment is assigned as a continuous domain.•An association score is calculated between non-adjacent fragments by assessing the shared sequence hits to the segments. If score > 50% then segments are considered asdiscontinuous domains and joined.
![Page 32: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/32.jpg)
Create domain profiles
• A representative set of the database sequence fragments that overlap a putative domain are selected for alignment using OBSTRUCT (Heringa et al. 1992). > 20% and < 60% sequence identity (including the query seq).
• A multiple sequence alignment is generated using PRALINE (Heringa 1999).
• Each domain multiple alignment is used as a profile in further database searches using PSI-BLAST (Altschul et al 1997).
• The whole process is iterated until no new domains are identified.
![Page 33: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/33.jpg)
Domain boundary prediction accuracy
• Set of 452 multidomain proteins
• 56% of proteins were correctly predicted to have more than one domain
• 42% of predictions are within 20 residues of a true boundary
• 49.9% (44.6%) correct boundary predictions per protein
![Page 34: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/34.jpg)
• 23.3% of all linkers found in 452 multidomain proteins. Not a surprise since:– Structural domain boundaries will not always
coincide with sequence domain boundaries– Proteins must have some domain shuffling
• For discontinuous proteins 34.2% of linkers were identified
• 30% of discontinuous domains were successfully joined
![Page 35: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/35.jpg)
Change in domain prediction accuracy using various PSI-BLAST E-value cut-offs
![Page 36: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/36.jpg)
Benchmarking versus PSI-BLAST
• A set 452 non-homologous multidomain protein structures.
• Each protein was delineated into its structural domains. Database searches of the individual domains were used as a standard of truth.
• We then tested to what extent PSI-BLAST and DOMAINATION, when run on the full-length protein sequences, can capture the sequences found by the reference PSI-BLAST searches using the individual domains.
![Page 37: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/37.jpg)
Two sets based on individual domain searches:
• Reference set 1: consists of database sequences for which PSI-BLAST finds all domains contained in the corresponding full length query.
• Reference set 2: consists of database sequences found by searching with one or more of the domain sequences
• Therefore set 2 contains many more sequences than set 1
Ref set 1 Ref set 2
Query
DB seqs
![Page 38: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/38.jpg)
Sequences found over Reference sets 1 and 2
PSI-BLASTvs Ref set 1
DOMAINATIONvs Ref set 1
PSI-BLASTvs Ref set 2
DOMAINATIONvs Ref set 2
Seq's found 28581 28921 67300 73274
Seq's missed 618 278 13542 7568
% missed 2.12 0.95 16.8 9.36
![Page 39: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/39.jpg)
Reference 1
• PSI-BLAST finds 97.9% of sequences
• Domaination finds 99.1% of sequences
Reference 2
• PSI-BLAST finds 83.2% of sequences
• Domaination finds 90.6% of sequences
![Page 40: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/40.jpg)
Sequences found over Reference sets 1 and 2 from 15 Smart sequences
PSI-BLASTvs Ref set 1
DOMAINATIONvs Ref set 1
PSI-BLASTvs Ref set 2
DOMAINATIONvs Ref set 2
Seq's found 323 347 3672 5902
Seq's missed 24 0 3438 1202
% missed 6.9 0 48.4 17.0
![Page 41: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/41.jpg)
SSEARCH significance test
• Verify the statistical significance of database sequences found by relating them to the original query sequence.
• SSEARCH (Pearson & Lipman 1988). Calculates an E-value for each generated local alignment.
• This filter will lose distant homologies.
• Use the 452 proteins with known structure.
![Page 42: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/42.jpg)
Significant sequences found in database searches
At an E-value cut-off of 0.1 the performance of DOMAINATION
searches with the full-length proteins is 15% better than PSI-BLAST
![Page 43: SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST](https://reader035.vdocuments.net/reader035/viewer/2022062314/568139c8550346895da17647/html5/thumbnails/43.jpg)
Summary
Domains are recurring evolutionary units: by collecting the N- and C- termini of local alignments we can identify domain boundaries.
By finding domains we can significantly improve database search methods
SnapDRAGON is more sensitive than DOMAINATION but at high computational cost