sequence comparison – identification of remote homologues
DESCRIPTION
Sequence Comparison – Identification of remote homologues. Amir Harel Moran Yassour. Overview. Homologues proteins Protein Sequence comparison BLAST and its improvements PSI-BLAST. Homologous Proteins. Proteins that share a common ancestor are called homologous. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/1.jpg)
Sequence Comparison – Identification of remote homologues
Amir HarelMoran Yassour
![Page 2: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/2.jpg)
Overview
Homologues proteins
Protein Sequence comparison
BLAST and its improvements
PSI-BLAST
![Page 3: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/3.jpg)
Homologous Proteins
Proteins that share a common ancestor are called homologous.
Common three dimensional folding structure
![Page 4: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/4.jpg)
Homologous Proteins
Homology refers to a similarity that spans an entire folding domain.
The difficulty in defining homology
![Page 5: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/5.jpg)
Why is homology important? Prediction of protein’s properties
Classification of proteins to families
Evolution tree
![Page 6: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/6.jpg)
How to identify homology?
Using sequence similarities Aligning two proteins Giving a score to the alignment
![Page 7: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/7.jpg)
Global & Local Alignments
Global alignment –alignment of the entire
sequence
Local alignment –alignment of a segment of the sequence
![Page 8: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/8.jpg)
How to score an alignment Substitution Matrix – Sij = a value
proportional to the probability that amino acid i mutated into amino acid j
![Page 9: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/9.jpg)
Types of Substitution Matrices
PAM – comparison of closely related sequences
BLOSUM – multiple alignments of distantly related sequences
![Page 10: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/10.jpg)
Substitution Matrices
Different matrices reflect different evolutionary distances: 1 PAM represents the evolutionary
distance of 1 amino acid substitution per 100 amino acids.
BLOSUM X: all sequences with a similarity higher than X were summarized into one
![Page 11: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/11.jpg)
Gap costs
The most widely used Gap score is-(a+bk) for a gap of length k.
Long gaps do not cost much more than short ones since a single mutation may cause a large gap.
![Page 12: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/12.jpg)
Basic Sequence Comparison Smith & Waterman (1981) –
dynamic programming of sequence comparison
O(mn)
m
n
![Page 13: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/13.jpg)
Complexity issue
When DBs become larger, m grows Time complexity Space complexity
![Page 14: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/14.jpg)
Intuition to Solution
Go over less than the whole matrix Put the spotlight on segments that
can be a part of the best path and extend them.
The best path is close to a diagonal
Less than O(mn) m
n
![Page 15: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/15.jpg)
Heuristic procedures
Heuristic: An algorithm that usually, but not always works, or that gives nearly the right answer.
There is no guarantee to find the best match.
![Page 16: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/16.jpg)
BLAST – Basic Local Alignment Search Tool
BLAST first scans the DB for words that score at least T when aligned with some word within the query sequence, these are called hits. O(n)
Each hit is extended in both directions as long as the score hasn’t dropped too much.
![Page 17: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/17.jpg)
- - - - - - - x - - - - - - x - - - x- x - - - - - - - - - x - - - - x - -- - - x - - - - - - - - x - - - - - -- - - - - - - - - - - - - - - x - - -- x - - - x - - - - x - - - - - x - -- - - - - - x - - - - - - - x - - x x- - - - - - - x x - - - - - - - - x xx - - - x - - - - - - - - x - - - - -- - - - x - - - - x - - - - - - - - -- - - - - x - - - - - - - - - - - x -- x - - - - x - - - - - - - - - - - -- - - - - - - - - - - - x - - x - - -- - - - - - - - - - - - - - - - - - x- - - - - - x - - - - - - - x - - x -- - - x - - - - - - - - x - - x - - -x - - - x - - - - x - - - x - - x - -- - - - - x - - - - - - - - - - x - -x - - - x - - - - x - - - x - - - - -
BLAST
![Page 18: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/18.jpg)
A word about the parameter T
Small T:greater sensitivity, more hits to expand
large T: lower sensitivity, fewer hits to expand
![Page 19: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/19.jpg)
Gapped BLAST
The original BALST was un-gapped
Soon after came gapped BLAST
![Page 20: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/20.jpg)
BLAST - Results
P value – The probability of an alignment occurring with score S or better.
E value – Expectation value. The number of different alignments with scores S or better that are expected to occur in this DB search by chance.
Lower E value –> more significant score.
![Page 21: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/21.jpg)
E-value and Homology Non significant score does not
necessarily imply non-homology:
![Page 22: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/22.jpg)
E-value and Homology
![Page 23: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/23.jpg)
Use it wisely
Choose your Substitution Matrix
Choose your DB
![Page 24: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/24.jpg)
Example 1 – remote homology Frequently, identification of a remote
homology will require several database searches.
The glutathione transferase family
![Page 25: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/25.jpg)
Remote homology
![Page 26: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/26.jpg)
Remote homology
Testing the possibility that elongation factors share homology with glutathione S-transferases :
There is a clear relationship between this elongation factor and the class-theta glutathione transferases.
![Page 27: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/27.jpg)
Example 2 - mapping
Three different families of G-protein coupled receptors: the R family (the largest) the C/S family the G receptor family
![Page 28: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/28.jpg)
Finding links between families
E-valueScoreName02347OPSD_HUMAN RHODOPSIN.01791OPSG_CHICK GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO 01002OPSG_HUMAN GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO
3.10E-30527OPS1_DROME OPSIN RH1 (OUTER R1-R6 PHOTORECEPTOR CE 1.10E-23435NK2R_MOUSE SUBSTANCE-K RECEPTOR (SKR) (NEUROKININ 1.50E-23431SSR5_HUMAN SOMATOSTATIN RECEPTOR TYPE 5. 3.50E-22419TXKR_HUMAN PUTATIVE TACHYKININ RECEPTOR. 6.40E-142835H7_HUMAN 5-HYDROXYTRYPTAMINE 7 RECEPTOR (5-HT-7) 8.50E-14280CKR1_HUMAN C-C CHEMOKINE RECEPTOR TYPE 1 (C-C CKR- 1.50E-13278ETBR_RAT ENDOTHELIN B RECEPTOR PRECURSOR (ET-B) (E1.60E-13276AA2B_RAT ADENOSINE A2B RECEPTOR.
0.007133MAS_MOUSE MAS PROTO-ONCOGENE. 0.007130PAFR_MACMU PLATELET ACTIVATING FACTOR RECEPTOR (PA 0.009135OLF2_RAT OLFACTORY RECEPTOR-LIKE PROTEIN F12. 0.01131MAS_RAT MAS PROTO-ONCOGENE. 0.01130CAR1_DICDI CYCLIC AMP RECEPTOR 10.02129OLF2_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR2. 0.05124CAR3_DICDI CYCLIC AMP RECEPTOR 3. 0.06120MAS_HUMAN MAS PROTO-ONCOGENE. 0.17117OLF1_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR1. 0.23121PER2_MOUSE PROSTAGLANDIN E RECEPTOR, EP2 SUBTYPE.
![Page 29: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/29.jpg)
Finding links between families
E-valueScoreFamilyName02678CAR1_DICDI CYCLIC AMP RECEPTOR 1. 01524CAR3_DICDI CYCLIC AMP RECEPTOR 3. 01497CAR2_DICDI CYCLIC AMP RECEPTOR 2.
0.00042167C/SCALR_HUMAN CALCITONIN RECEPTOR PRECURSOR (CT-R). 0.00073161RIL8B_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 0.00087162C/SCLRA_RAT CALCITONIN RECEPTOR A PRECURSOR (CT-R-A) 0.00095162C/SCLRB_RAT CALCITONIN RECEPTOR B PRECURSOR (CT-R-B)
0.0045150C/SDIHR_MANSE DIURETIC HORMONE RECEPTOR (DH-R). 0.012145C/SCALR_PIG CALCITONIN RECEPTOR PRECURSOR (CT-R). 0.012145C/SGLR_RAT GLUCAGON RECEPTOR PRECURSOR (GL-R). 0.016141RIL8B_RABIT HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 0.022139RRDC1_HUMAN G PROTEIN-COUPLED RECEPTOR RDC1 HOMOLOG 0.061133RG10D_RAT PROBABLE G PROTEIN-COUPLED RECEPTOR G10D 0.085130ROPSD_HUMAN RHODOPSIN. 0.098131C/SVIPR_HUMAN VASOACTIVE INTESTINAL POLYPEPTIDE RECEP
0.11129ROPSD_SPHSP OPSIN. 0.13129C/SSCRC_RAT SECRETIN RECEPTOR PRECURSOR. 0.14127RIL8A_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR A 0.16143.1C/SGLPR_RAT GLUCAGON-LIKE PEPTIDE 1 RECEPTOR PRECURSO 0.16126RAG2S_XENLA TYPE-1-LIKE ANGIOTENSIN II RECEPTOR 2
![Page 30: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/30.jpg)
Building Proteins tree
![Page 31: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/31.jpg)
Conclusions
Searches with high-scoring, related or unrelated sequences, is a very important tool.
Homology is a transitive relation…
![Page 32: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/32.jpg)
BLAST – Pros & Cons
Pros: It works
Cons: Statistical evaluations rather than
biological one. Converged Evolution Weak but biologically relevant
similarities may be overlooked (PSI will improve this issue)
![Page 33: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/33.jpg)
BLAST improvements
Running time improvements : Two-hit method Seed extension
PSI-BLAST
![Page 34: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/34.jpg)
The two-hit method
The extension step accounts for more than 90% of BLAST’s execution time
Invoke an extension only when two non-overlapping hits are found within a certain distance of one another
![Page 35: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/35.jpg)
- - - - - - x x x - - - - - x - - x x- x - - - x - - - - x x - - - - x - -- - - x - - - - - - - - x - - x - - -- - x - - - - - - - - - x - - x - - -- x - - - x - - - - x x - - - - x - -- - - - - - x - - - - - - - x - - x x- - - - - - - x x - - - - - x - - x xx - - - x - - - - - - - - x - - - - -x - - - x - - - - x - - - x - - - - -- x - - - x - - - - x - - - - - x - -- x - - - x - - - - x x - - - - x - -- - - - - - - - - - - - x - - x - - -- - - - - - - - - - - - - - - - - x x- - - - - - x x x - - - - - x - - x x- - x x - - - - - - - - x - - x - - -x - - - x - - - - x - - - x - - - - -- x - - - x - - - - x x - - - - x - -x - - - x - - - - x - - - x - - - - -
first hit
second hit
two-hit extension
The two-hit method
![Page 36: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/36.jpg)
Seed Extension
![Page 37: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/37.jpg)
PSI-BLAST
Evolution pressure
Needle in a hey stack
PSI-BLAST comes to solve this problem
![Page 38: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/38.jpg)
Evolution reveals itself
Giving more significance to the conserved areas and to ignoring the background noises
PSI-BLAST = Position Specific Iterated BLAST, shifts our view to these areas using the Position-Specific Score Matrix - PSSM
![Page 39: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/39.jpg)
Position-Specific Matrix - PSSM Pij = proportional to the probability of
finding the ith amino acid in the jth position in these sequences
![Page 40: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/40.jpg)
PSSM
Represents the distribution of the amino acids in each position in a collection of sequences
![Page 41: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/41.jpg)
Steps in the PSI-BLAST Initiation:
Running gapped BLAST on the query, outputting a collection of matching sequences
Iteration: Constructing the PSSM based on the best
sequences in this collection
The PSSM is compared to the protein DB, again, seeking alignments
![Page 42: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/42.jpg)
PSI-BLAST Example
We start with an uncharacterized protein – MJ0414
When submitting the query we set the E-value threshold to 0.01 (higher than usual)
![Page 43: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/43.jpg)
Result of initial gapped BLAST
![Page 44: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/44.jpg)
First iteration –
Iterating the search using the derived profile uncovers DNA ligase II with E-value of 0.005
![Page 45: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/45.jpg)
Second iteration –
![Page 46: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/46.jpg)
Interpretation of the results Considering a strong unrelated protein
will shift the PSSM to its direction
E-values retrieved in later iterations should not be taken as automatic proof of homology
![Page 47: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/47.jpg)
Was the ligase a right choice?
![Page 48: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/48.jpg)
PSI-BLAST Conclusions Uncovers protein relationships missed
by single-pass database-search methods
Errors are easily amplified by iterations.
PSI-BLAST increases rather than removes the need for expertise, because there is more to interpret
![Page 49: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/49.jpg)
Running time evaluation
Running time can be highly influenced by modifying parameters
Smith Waterman
Original BLAST
Gapped BLAST
PSI BLAST
Normalized Running time
36 1.0 0.34 0.87
![Page 50: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/50.jpg)
Future Improvements
Accepting PSSM as input from other programs
Realignment – improve the alignment before going over the DB
Automatic domain recognition
![Page 51: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/51.jpg)
Summary
In BLAST use multiple searches for maximum knowledge
BLAST improvements are considerably faster, and enhance significantly the abilities of DB search
For many queries the PSI BLAST can greatly increase sensitivity to weak, but biologically relevant sequence relationships
![Page 52: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/52.jpg)
Questions time
Thank You
![Page 53: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/53.jpg)
References Pearson WR. (1997) Identifying distantly related protein
sequences. Comput Appl Biosci., 13, 325-332
Altschul SF, Massen TL, Shaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-3402
Altschul SF, Koonin EV. (1998) Iterated profile searches with PSI-BLAST – a tool for discovery in protein databases. Trends Biochem Sci., 23, 444-447
![Page 54: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/54.jpg)
Sites
http://www.ncbi.nlm.nih.gov/BLAST http://www.cs.huji.ac.il/~cbio http://www.people.virginia.edu/~w
rp/ http://www-lmmb.ncifcrf.gov/
![Page 55: Sequence Comparison – Identification of remote homologues](https://reader036.vdocuments.net/reader036/viewer/2022062423/56814c1c550346895db91cbb/html5/thumbnails/55.jpg)
Appendix - Statistics
2ln
ln'
kSS
E
NS 2log'
nmN
'2 SN
E