eric c. rouchka, university of louisville satchmo: sequence alignment and tree construction using...
TRANSCRIPT
Eric C. Rouchka, University of Louisville
SATCHMO: sequence alignment and tree construction using hidden
Markov models
Edgar, R.C. and Sjolander, K. Bioinformatics. 19(11):1404-1411.
CECS 694-04 Bioinformatics Journal Club
Eric Rouchka, D.Sc.
September 10, 2003
Eric C. Rouchka, University of Louisville
What is Multiple Sequence Alignment (MSA) ?
• Taking more than two sequences and aligning based on similarity
Eric C. Rouchka, University of Louisville
Globin Example>gamma_AMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKL
HVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH>alfaVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP
VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR>betaVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
HVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH>deltaVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDK
LHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH>epsilonVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLH
VDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH>gamma_GMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKL
HVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH>myoglobinMGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATK
HKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG>teta1ALSAEDRALVRALWKKLGSNVGVYTTEALERTFLAFPATKTYFSHLDLSPGSSQVRAHGQKVADALSLAVERLDDLPHALSALSHLHACQLRVDPAS
FQLLGHCLLVTLARHYPGDFSPALQASLDKFLSHVISALVSEYR>zetaSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKL
LSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR
Eric C. Rouchka, University of Louisville
Why do MSA?
• Homology Searching– Important regions conserved across (or
within) species• Genic Regions• Regulatory Elements
• Phylogenetic Classification• Subfamily classification• Identification of critical residues
Eric C. Rouchka, University of Louisville
MSA Approaches
• All columns alignable across all sequences– MSA– ClustalW
• Columns alignable throughout all sequences singled out (Profile HMM)– HMMER– SAM
Eric C. Rouchka, University of Louisville
MSA
• N-dimensional dynamic programming
• Time consuming
• High memory usage
• Guaranteed to yield maximum alignment
Eric C. Rouchka, University of Louisville
ClustalW
• Progressive Alignment– Sequences aligned in pair-wise fashion– Alignment scores produce phylogenetic
tree
– Enhanced dynamic programming approach
Eric C. Rouchka, University of Louisville
Hidden Markov Models
• Match State, Insert State, Delete State
Eric C. Rouchka, University of Louisville
HMMs
• Models conserved regions
• Successful at detecting and aligning critical motifs and conserved core structure
• Difficulty in aligning sequence outside of these regions
Eric C. Rouchka, University of Louisville
SATCHMO
• Simultaneous Alignment and Tree Construction using Hidden Markov mOdels
www.lib.jmu.edu/music/composers/ armstrong.htm
Eric C. Rouchka, University of Louisville
SATCHMO
• Progressive Alignment– Built iteratively in pairs– Profile HMMs used
• Alignments of same sequences not same at each node
• Number of columns predicted smaller as structures diverge
• Output not represented by single matrix
Eric C. Rouchka, University of Louisville
Why HMMs?
• Homologs ranked through scoring
• Accurate profiles from small numbers of sequences
• Accurately combines two alignments having low sequence similarity
Eric C. Rouchka, University of Louisville
Bits saved relative to background
• K = 1..M: HMM node number• a: amino acid type
• Pk(a): emission probability of a in kth match state
• P0(a): approximation of background probability of a
Eric C. Rouchka, University of Louisville
Sequence weights
• Sequences weighted such that b converges on a desired value
• Weights compensate for correlation in sequences
Eric C. Rouchka, University of Louisville
HMM Construction
• Profile HMM constructed from multiple alignment
• Some columns alignable; others not
Eric C. Rouchka, University of Louisville
HMM Construction
• Given an alignment a, a profile HMM is generated
• Each column in a is assigned to an emitter state – transition probabilities are calculated based on observed amino acids
Eric C. Rouchka, University of Louisville
Transition Probabilities
• If we have a total of five match states, the probabilities can be stored in the following table:
Eric C. Rouchka, University of Louisville
HMM Terminology
: Path through an HMM to produce a sequence s
• P(A|) = P(s| s)
+: maximum probability path through the HMM
Eric C. Rouchka, University of Louisville
Aligning Two Alignments
• One alignment is converted to an HMM
• Second alignment is aligned to the HMM– Some columns remain alignable– Affinities (relative match scores) calculated
• New MSA results• HMM Constructed from new MSA
Eric C. Rouchka, University of Louisville
SATCHMO Algorithm
• Step 1: – Create a cluster for each input sequence and
construct an HMM from the sequence
• Step 2: – Calculate the similarity of all pairs of clusters and
identify a pair with highest similarity – align the target and template to produce a new
node
Eric C. Rouchka, University of Louisville
SATCHMO Algorithm
• Repeat set 2 until:– All sequences assigned to a cluster– Highest similarity between clusters is below a
threshold– No alignable positions are predicted
• Output: A set of binary trees – Nodes are sequences– Each node contains an HMM aligning the
sequences in the subtree
Eric C. Rouchka, University of Louisville
Validation Set
• BAliBASE benchmark alignment set used– Ref1: equidistant sequences– Ref2: distantly related sequences– Ref3: subgroups of sequences; < 25%
similarity between groups– Ref4: alignments with long extensions on
the ends– Ref5: alignments with long insertions
Eric C. Rouchka, University of Louisville
Comparision of Results
• SATCHMO compared to:– ClustalW (Progressive Pairwise Alignment)– SAM (HMM)