sequence similarity
DESCRIPTION
Sequence Similarity. x i. ―. x i. y j. MATCH. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins. INSERT X. INSERT Y. ―. y j. x i. y j. MATCH. INSERT X. INSERT Y. x i. ―. ―. y j. A pair-HMM model of pairwise alignment. - PowerPoint PPT PresentationTRANSCRIPT
Sequence Similarity
PROBCONS: Probabilistic Consistency-based Multiple
Alignment of Proteins
INSERTINSERTXX
INSERTINSERTYY
MATCHMATCH
xxiiyyjj
――yyjj
xxii――
INSERTINSERTXX
INSERTINSERTYY
MATCHMATCH
A pair-HMM model of pairwise alignment
Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences
Transition probabilities ~ gap penalties
Emission probabilities ~ substitution matrix (from BLOSUM)
ABRACA-DABRAAB-ACARDI---
xxyy
xxiiyyjj
――yyjj
xxii――
Computing Pairwise Alignments
• The Viterbi algorithm conditional distribution P(α | x, y) reflects model’s uncertainty over the “correct”
alignment of x and y identifies highest probability alignment, αviterbi, in O(L2) time
Caveat: the most likely alignment is not the most accurate Alternative: find the alignment of maximum expected accuracy
P(α)P(α)
P(α | x, y)P(α | x, y)
ααviterbiviterbi
The Lazy-Teacher Analogy
• 10 students take a 10-question true-false quiz• How do you make the answer key?
Approach #1: Use the answer sheet of the best student! Approach #2: Weighted majority vote!
A- AAB A- A
B+ B+B+B- B- C
4. F4. F 4. T 4. F 4. F
4. F4. F 4. F 4. F 4. T
Viterbi vs. Maximum Expected Accuracy (MEA)
Viterbi
• picks single alignment with highest chance of being completely correct
• mathematically, finds the alignment α that maximizes
Eα*[1{α = α*}]
Maximum Expected Accuracy
• picks alignment with highest expected number of correct predictions
• mathematically, finds the alignment α that maximizes
Eα*[accuracy(α, α*)]
AA4. T A- AAB A- A
B+ B+B+B- B- C
4. F4. F 4. T 4. F 4. F
4. F4. F 4. F 4. F 4. T
Computing MEA alignments
• Defineaccuracy (α, α*) =
Eα*(accuracy(α, α*) | x, y) ~ Eα*(∑(xi, yj) in α1((xi, yj) in α*) | x,y)
= ∑α’P(α’ | x, y) ∑(xi, yj) in α 1((xi, yj) in α’)
= ∑(xi, yj) in α ∑α’P(α’ | x, y) 1((xi, yj) in α’)
= ∑(xi, yj) in α P(xi, yj in α’ | x, y)
• Define M[i, j] = posterior probability that xi is aligned to yj
# of correct predicted matches# of correct predicted matcheslength of shorter sequencelength of shorter sequence
Computing MEA alignments
• Define
accuracy (α, α*) =
• Then, MEA alignment is highest summing path through the matrix
M[i, j] = P(xi is aligned to yj | x, y)
• M[i, j] = posterior probability that xi is aligned to yj Can compute with forward, backward dynamic programming in
O(L2) time
# of correct predicted matches# of correct predicted matcheslength of shorter sequencelength of shorter sequence
Computing MEA alignments
• Defineaccuracy (α, α*) =
• Then, MEA alignment is highest summing path through the matrix
M[i, j] = P(xi is aligned to yj | x, y)
• M[I, j] = posterior probability that xi is aligned to yj Can compute with forward, backward dynamic programming in
O(L2) time
# of correct predicted matches# of correct predicted matcheslength of shorter sequencelength of shorter sequence
The consistency signal
zz
xx
yy
xxii
yyjj yyj’j’
zzkk
To estimate P(xi yj | x, y, z)
Method 1: triplet-HMM
P(xi ~ yj | x, y, z) = ∑k P(xi~yj~zk | x, y, z)
Parameters trained with unsupervised EM
Running time: O(N3L3)N: # sequencesL: sequence lengths
XYZ Y
XYX
XZ
YZZ
1
1
1 2
2
2
1
1
1
2
2
2
1
1
1
1
11
2
2
22
22
211 211
211
121
121 121
Probabilistic consistency
• Compute P(xi is aligned to yj | x, y) P(xi is aligned to yj | x, y, z)
• 2 approaches: 1) Exact – triplet HMM, O(L3) time 2) Approximate – use independence assumptions
∑k P(xi ~ zk and zk ~ yj | x, y, z) =
∑k P(xi ~ zk | x, z) P(zk ~ yj | x, y, z, xi ~ zk) (assume indep.)
∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)
Probabilistic consistency
• Compute P(xi is aligned to yj | x, y, z)
To compute P(xi ~ yj | x, y, z) ~ ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)
Notice that for any given i, most entries k and j will be close to 0-- sparse matrices
Pxy|z PxzPzy
Finally, let
Pxy|S 1/|S| ∑z in S PxzPzy
Multiple sequence alignment
• A straightforward generalization sum-of-pairs tree-based progressive alignment iterative refinement
ABRACA-DABRAAB-ACARDI---ABRA---DABI-
AB-ACARDI---ABRA---DABI-
ABRACADABRAABRA--DABI-
ABRACA-DABRAAB-ACARDI---
Multiple sequence alignment
• A straightforward generalization sum-of-pairs tree-based progressive alignment iterative refinement
ABRACA-DABRAAB-ACARDI---ABRA---DABI-
AB-ACARDI---ABRA---DABI-
ABRACADABRAABRA--DABI-
ABRACA-DABRAAB-ACARDI---
ABRACA-DABRAAB-ACARDI---ABRA---DABI-
ABACARDIABRACADABRA
ABRACA-DABRAAB-ACARDI---
ABRADABI
ABRACA-DABRAAB-ACARDI---ABRA---DABI-
ABACARDI
ABRACADABRAABRA--DABI-
ABRACA-DABRAAB-ACARD--I-ABRA---DABI-
Summary of PROBCONS Algorithm
Given K sequences to be aligned,
(1) Compute M[i, j] for all pairs of sequences, x and y
(2) Use probabilistic consistency to reestimate M[i, j]
(3) Build a tree of the sequences by connecting closest first • “Closest” defined according to expected accuracy • EA(x, y) = E(accuracy) of MEA alignment of x and y
(4) Perform progressive alignment along the tree• Score of a column: sum-of-pairs M[i, j]
(5) Apply iterative refinement
Training/testing methodology
• 3 reference benchmark sets
• PROBCONS parameters trained via unsupervised EM on unaligned sequences from BAliBASE.
• Quality score:
Q(α, α*) =
BAliBASEBAliBASE PREFABPREFAB SABmarkSABmark
# of correct predicted matches# of correct predicted matchestotal # of true matchestotal # of true matches
Evaluation of Algorithm Components
Algorithm Quality(74)
Time(sec)
Viterbi 0.375 0.72MEA 0.403 1.6PC (O(L3)) 0.431 584.2PC x 1 (O(L2)) 0.422 1.7PC x 2 (O(L2)) 0.427 1.9Progressive PC x 2 (O(L2)) 0.432 1.9Progressive PC x 2 (O(L2)) + IR 0.435 3.3
all-pairsall-pairspairwisepairwise
multiplemultiple
Performance of different alignment tools
Algorithm BAliBASE(237)
PREFAB(1932)
SABmark(698)
Q t Q t Q tAlign-m 0.804 19:25 - - 0.352 56:44DIALIGN 0.832 2:53 0.572 12:25:00 0.410 8:28CLUSTALW 0.861 1:07 0.589 2:57:00 0.439 2:16MAFFT 0.882 1:18 0.648 2:36:00 0.442 7:33T-Coffee 0.883 21:31 0.636 144:51:00 0.456 59:10MUSCLE 0.896 1:05 0.648 3:11:00 0.464 20:42PROBCONS 0.910 5:32 0.668 19:41:00 0.505 17:20
Resources for alignment
Protein Multiple Alignershttp://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used (1994)
http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py
MUSCLE – most scalable (2004)
http://probcons.stanford.edu/ PROBCONS – most accurate (2004)
Some more protein multiple aligners:
MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT, DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX, PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP, etc…….
ProbCons: Chuong (Tom) Do
Profile hidden Markov models for sequence famillies
PFAM
Protein FAMilies database of alignments
• Profile HMMs describe each family
• For each family in Pfam you can: Look at multiple alignments View protein domain architectures Examine species distribution Follow links to other databases View known protein structures
PFAM
Pfam-A – curated multiple alignments Grows slowly; quality controlled by experts
Pfam-B – automatic clustering (ProDom derived) New sequences instantly incorporated; unchecked
• Search by: Sequence, keyword, domain, taxonomy
• Browsing by family or genome
• Evolutionary tree
• Source of seed alignments: Pfam-B families Published articles ‘Domain hunting' studies
Profile HMMs
• Each M state has a position-specific pre-computed substitution table• Each I state has position-specific gap penalties (and in principle can
have its own emission distributions)• Each D state also has position-specific gap penalties
In principle, D-D transitions can also be customized per position
M1 M2 Mm
Protein Family F
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
Profile HMMs
transition between match states – αM(i)M(i+1)
transitions between match and insert states – αM(i)I(i), αI(i)M(i+1)
transition within insert state – αI(i)I(i)
transition between match and delete states – αM(i)D(i+1), αD(i)M(i+1)
transition within delete state – αD(i)D(i+1)
emission of amino acid b at a state S – εS(b)
M1 M2 Mm
Protein Family F
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
Profile HMMs
transition probabilities ~ frequency of a transition in alignment emission probabilities ~ frequency of an emission in alignment pseudocounts are usually introduced
M1 M2 Mm
Protein Family F
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
aAAklkl
k ll
'
'
e aE aE akk
ka
( )( )
( )'
' '
Alignment of a protein to a profile HMM
To align sequence x1…xn to a profile HMM:
We will find the most likely alignment with the Viterbi DP algorithm
• Define Vj
M(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Mj
VjI(i): score of best alignment of x1…xi to the HMM ending in xi being
emitted from Ij
VjD(i): score of best alignment of x1…xi to the HMM ending in Dj (xi is
the last character emitted before Dj)
• Denote by qa the frequency of amino acid a in a ‘random’ protein
Alignment of a protein to a profile HMM
Vj-1M(i – 1) + log αM(j-1)M(j)
• VjM(i) = log (εM(j)(xi) / qxi) + max Vj-1
I(i – 1) + log αI(j-1)M(j)
Vj-1D(i – 1) + log αD(j-1)M(j)
VjM(i – 1) + log αM(j)I(j)
• VjI(i) = log (εI(j)(xi) / qxi) + max Vj
I(i – 1) + log αI(j)I(j)
VjD(i – 1) + log αD(j)I(j)
Vj-1M(i) + log αM(j-1)D(j)
• VjD(i) = max Vj-1
I(i) + log αI(j-1)D(j)
Vj-1D(i) + log αD(j-1)D(j)
Weight of each sequence
• One simple weighting scheme is to find how much edge length each leaf contributes Example: edge 1 belongs to a Example: edge 3 belongs both to a, and to b: e3e1/(e1+e2) goes to a
Δwi = ecurrent wi / (leaves k below ecurrentwk)
ab
cd
efghi
13
2
How to build a profile HMM
Resources on the web
• HMMer – a free profile HMM software http://hmmer.wustl.edu/
• SAM – another free profile HMM software http://www.cse.ucsc.edu/research/compbio/sam.html
• PFAM – database of alignments and HMMs for protein families and domains http://www.sanger.ac.uk/Software/Pfam/
• SCOP – a structural classification of proteins http://scop.berkeley.edu/data/scop.b.html