txtpred: a new method for protein secondary structure prediction
DESCRIPTION
TXTpred: A New Method for Protein Secondary Structure Prediction. Yan Liu, Jaime Carbonell, Judith Klein- Seetharaman School of Computer Science Carnegie Mellon University May 14, 2003. Roadmap. Overview on secondary structure prediction Description of TXTpred method - PowerPoint PPT PresentationTRANSCRIPT
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
TXTpred: A New Method for Protein Secondary Structure Prediction
Yan Liu, Jaime Carbonell, Judith Klein- SeetharamanSchool of Computer ScienceCarnegie Mellon University
May 14, 2003
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Roadmap• Overview on secondary structure
prediction• Description of TXTpred method• Experiment results and analysis• Discussion and further work
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Secondary Structure of a Protein Sequence
• Dictionary of Secondary Structure Prediction annotates each residue with its structure (DSSP)– based on hydrogen bonding patterns
and geometrical constraints • 7 DSSP labels for PSS:
– Helix types: H G (alpha-helix 3/10 helix)– Sheet types: B E (isolated beta-bridge
strand)– Coil types: T _ S (Coil)
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Secondary Structure of a Protein Sequence
• Accuracy Limit ~ 88%
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Task Definition• Given a protein sequence:
– APAFSVSPASGA• Predict its secondary structure
sequence:– CCEEEEECCCCC– Focus on soluble proteins, not on
membrane protein
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Overview of Previous Work -1
• 1st-generation method – Calculate propensities for each amino acid
• E.g. Chou-Fasman method (Chou & Fasman, 1974)• 2nd-generation method
– “Window” concept• APAFSVSPAS (window size = 7)
– Calculate propensities for segments of 3-51 amino acids
• E.g. GOR method (Garnier et al, 1978)
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Overview of Previous Work -2
• 3rd-generation method– Use evolutional information multiple
sequence alignment• p-Value cut-off = 10-2 • PHD: Neural Network & Sequence features only (Rost &
Sander, 1993)• DSC: LDA & Biological features: GOR, hydrophobicity
etc. (King & Sternberg, 1996)– Later Refinement
• Apply divergent sequence alignment: e.g. PROF (Ouali & King, 2000)
• Combine results of different system: e.g. Jpred (Cuff & Barton, 1999)
• Bayesian Segmentation (Schmidler et al, 1999)
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Summary of Performance
Method Name Performance (Q3)Chou-Fasman ~ 50%
GOR ~ 56%PHD ~ 71%DSC ~ 70%
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Disadvantage of Previous Work
• Most are “black box” predictors– Weak biological meanings
• Little focus on long-range interaction– Mostly focused on local information
• Performance is asymptotically bounded
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Roadmap• Overview on secondary structure
prediction• Description of TXTpred method• Experiment results and analysis• Discussion and further work
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
TXTpred• Basic idea:
– Build meaningful biological vocabulary – Apply language technique for prediction
• Major challenge:– How to build the vocabulary?
• Context-free N-gram of amino acids inside the window
– Sq: APAFSVSPAS (window = 7)– N-gram: P, A, ..,P, PA, AF, ..SP, PAF, AFS,..,VSP
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Biological Vocabulary• Context sensitive vocabulary
– Analogy• Same word might have different meanings:
e.g. “bank”• Same amino acid might have different
properties: APAFSVSPAS– Encode context semantics into the N-
gram• Record the position information in the N-gram• Example: APAFSVSPAS (window size = 7)
– Words: P-3, A-2, F-1, S+0, V+1, S+1, P+1
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Text Classification• Text classification
– Analogy• The topic of a document is expressed by
the words of the document• The structure of one residue can be
inferred from the biological words nearby– High Accuracy– Text Classification Technique
• Doc to Vectors:• Classifiers: Support Vector Machines
)log()]log(1[)(frequencydocument
Nfrequencywordwordtw
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
TXTpred MethodSettings:
Window = 17One-gram, two-gramFeature Num = 3000
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Evaluation Measure• Q3 (accuracy)
• Precision, Recall
• Segment Overlap quantity (SOV)
• Matthew’s Correlation coefficients
)(
)1()2;1(
)2;1()2;1(1)2,1(iS
SLENSSMAXOV
SSDELTASSMINOVN
SSSOV
))(()()( iiiiiiii
iiiii onunopup
ounpC
P + P-T + P uT - o n
uonpnpQ
3
oppQ pre
uppQ
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Experimental Results• RS126 datasets
• CB513 datasets
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Biological language Properties
Power Law?
One-gram Two-gram
Term Frequency = f(Rank)
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Sequence Analysis -1Feature Selection
• Top ten Discriminating features for Helix
• Verification by Chou-Fasman parameters– Helix favors A, E, M,
L, K (top 5 amino acids)
– disfavors P (top 1 amino acid)
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Sequence Analysis -1Feature Selection
• Top ten Discriminating features for Sheet
• Verification by Chou-Fasman parameters– Sheets favors V, I,
Y, F, W (top 5 amino acids)
– Disfavors D, E (top 2 amino acids)
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Sequence Analysis -1Feature Selection
• Top ten Discriminating features for Coil
• Verification by Chou-Fasman parameters– Coil favors N, P, G,
D, S (top 5 amino acids)
– Disfavors V, I, L (top 3 amino acids)
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Sequence Analysis –2Word Correlation
• Word correlation • Some words have strong correlation and
co-occur frequently • Technique: Singular Vector
Decomposition• Examples from texts
• Phrases: {president, Bush}• Semantic correlated: {Olympic, sports}
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Sequence Analysis – 2 Word Correlation
• Top ten correlated word pairs
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Sequence Analysis – 2 Word Correlation
Regular Expression
ProteinSequence
Secondary
Structure
Conjecture
CPXXAI Sq1:ECPNEAIMSq2:ECPAEAIKSq3:GCPI PAIL
L1: HCCCCCECL2: HCCCCCEEL3: CCCCCEEE
Coil connected to Sheet
PGH Sq1: TFPGHSASq2: DCPGHAD
L1: CCCCCCCL2: ECCCHHH
Coil
EEL Sq1: DDEELLESq2: WSEELNS
L1:CCHHHHHL2:CCHHHHH
Helix
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Conclusion• TXTpred Summary
– Context sensitive biological vocabulary– Novel application of text classification to
secondary structure prediction– Comparable performance for secondary
structure prediction– Analysis provides reasonable biological
meanings and structure indicators
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Future Work• Deeper study on extracting more
meaningful biological vocabulary• Further discovery of new features,
such as torsion angle and free energy
• Advanced learning models to consider long-range interactions
• Conditional random fields, Maximum entropy markov model
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Acknowledgement
• Vanathi Gopalakrishnan, Upitt
• Ivet Barhar, UPitt
Carnegie MellonSchool of Computer Science Biological Language Modeling
ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.
Motivation for 2-D prediction
• Basis for three-dimensional structure prediction
• Improving other sequence and structure analysis– Sequence alignment– Threading and homologous modeling– Experimental data– Protein design