sequence analysis of nucleic acids and proteins: part 2 based on chapter 3 of post-genome...
Post on 25-Dec-2015
220 Views
Preview:
TRANSCRIPT
Sequence analysis of nucleic acids and proteins: part 2
Based on Chapter 3 of
Post-genome bioinformatics
by Minoru Kanehisa
Oxford University Press, 2000
Prediction of structure and function
Search and learning problems in sequence analysisProblems in Biological Science Math/Stat/CompSci method
Similarity search Pairwise sequence alignmentDatabase search for similarsequencesMultiple sequence alignmentPhylogenetic treereconstructionProtein 3D structure alignment
Optimization algorithms• Dynamic programming
(DP)• Simulated annealing (SA)• Genetic algorithms (GA)• Markov Chain Monte Carlo
(MCMC: Metropolis andGibbs samplers)
• Hopfield neural networkStructure/function prediction
ab initioprediction
RNA secondary structurepredictionRNA 3D structure predictionProtein 3D structureprediction
Knowledge basedprediction
Motif extractionFunctional site predictionCellular localizationpredictionCoding region predictionTransmembrane domainpredictionProtein secondary structurepredictionProtein 3D structureprediction
Pattern recognition andlearning algorithms• Discriminant analysis• Neural networks• Support vector machines• Hidden Markov models
(HMM )• Formal grammar• CART
Molecular classification Superfamily classificationOrtholog/paralog grouping ofgenes3D fold classification
Clustering algorithms• Hierarchical, k-means,
etc• PCA, MDS, etc• Self-organizing maps, etc
Thermodynamic principle
The amino acid sequence contains all the information necessary to fold a protein molecule into its native 3D state under physiological conditions: fold, denature, spontaneously refold, called Anfinsen’s thermodynamic principle
Thus it should be possible to predict 3D structure computationally by minimizing a suitable conformational energy function, but difficult to define, difficult to minimize (globally), called ab initio
In practice, structures determined by X-ray crystallography and nuclear magnetic resonance (NMR) are used to give empirical structure-function relationships.
A schematic illustration of RNA secondary structure elements.
Stem Hairpin loop
Pseudo knot
Bulge loop Internal loop Branch loop
RNA secondary structure can be predicted ab initio using an energy function and DP to minimize it, in a process similar to alignment
Yeast alanyl transfer RNA
A C C AG.CC.GG.CG.UA.UU.AU.A C U G ACAC A G C
The definition of a dihedral angle and the three backbone dihedral angles, in a protein. Because is around 180O, the backbone configuration can be specified by and for each peptide unit.
N
C’
C’
C
N C’ C N C’
C N C’ C
H O R H H O
O H RHH R
Peptide unit
Prediction of protein secondary structure: many methods
Prediction of protein secondary structure
The options are -helix, -strand and coil.
Many 2º structure prediction methods exist, with ones by Chou-Fasman and another due to Garnier,Osguthorpe and Robson being widely used. These are position&structure-specific scoring matrices based on modest or large numbers of proteins. On the next page we display the GOR PSSM for -helices.
These days one can choose from methods based on almost every major machine learning approach: ANN, HMM, etc.
Helix State-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8
G -16 -18 -18 -29 -41 -51 -67 -85 -105 -64 -42 -37 -30 -33 -26 -21 -17A 18 20 23 25 32 40 45 45 62 58 51 45 48 43 37 30 32V 1 -1 -5 0 -2 -9 -10 -5 4 -5 -3 -8 -11 -1 0 -7 -7L 17 19 22 28 23 29 37 37 51 48 54 59 41 36 34 28 15I -21 -19 -15 -5 0 2 10 9 17 12 8 12 6 6 16 18 9S -23 -16 -18 -13 -20 -25 -27 -31 -51 -41 -47 -43 -35 -34 -38 -34 -36T -13 -21 -16 -16 -14 -11 -7 -14 -28 -30 -33 -30 -20 -17 -18 -12 -8D 16 20 18 14 23 22 19 26 -1 -5 -26 -35 -21 -6 -3 -1 1E 19 24 31 35 39 36 36 45 52 40 14 -17 -13 -14 -10 -7 -2N 2 3 -2 -6 -6 -9 -16 -22 -44 -29 -24 -13 0 -2 -4 -5 3Q 7 9 6 0 7 0 -3 10 23 35 29 23 16 10 0 0 1K 25 24 22 18 14 16 16 25 28 37 44 54 49 44 39 44 47H 14 0 -7 -6 -14 -6 -2 1 2 21 24 25 27 25 19 25 31R 1 -5 -19 -25 -16 -16 -7 -4 -1 -1 3 6 0 0 -6 8 0F 0 7 17 23 23 18 29 26 32 40 34 28 12 3 15 6 4Y -8 -9 -10 -18 -13 -13 -31 -26 -15 -24 -18 -23 -28 -19 -16 -18 -23W 8 18 11 9 2 26 37 29 30 17 -1 12 13 11 31 13 2C -77 -71 -74 -74 -67 -60 -71 -61 -47 -46 -56 -58 -67 -70 -71 -80 -81M 2 -12 -9 -1 0 21 33 25 34 41 39 44 29 15 4 -2 -11P 0 -6 -7 -6 -15 -22 -35 -47 -68 -179 -95 -72 -53 -37 -28 -22 -11X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Cter Nter
Two architectures of the hierarchical neural network: (a) the perceptron and (b) the back-propagation neural network.
Input layer Output layer
Input
LayerHidden
Layer
Output
Layer
Prediction of transmembrane domains
Membrane proteins are very common, perhaps 25% of all. Membranes are hydrophobic and so a transmembrane domain typically has hydrophobic residues, about 20 to span the membrane.
There are a number of rules for detecting them: Kyte-Doolittle hydropathy scores work fairly well, and the Klein-Kanehisa-DeLisi discriminant function does even better.
Photosynthetic reaction centre (PDB:1PRC)
Outer membrane protein: porin(PDB: 1OMF)
Three-dimensional structures of two membrane proteins
Hidden Markov Models (HMMs)S = States {s0,s1,…..,sn}
V = Output alphabet {v0,v1,…..,vm}
A = { aij} = transition probability from si sj
B = {bi(j)} = probability outputting vj in state si
• What is the probability of a sequence of observations?
• What are the maximum likelihood estimates of parameters in an HMM?
• What is the most likely sequence of states that produced a given sequence of observations?
A hidden Markov model for sequence analysis
d1 d2 d3 d4
I0 I2 I3 I4I1
m0 m1 m2 m3 m4 m5
Start End
m=match state (output), I=insert state (output), d=delete state (no output)
Prediction of protein 3D structures
Knowledge based prediction of protein 3D or 3º structure can be classified into two categories: comparative modelling and fold recognition. The first can work well when there is significant sequence similarity to a protein with known 3D structure. By contrast, fold recognition is used when no significant sequence similarity exists, and makes use of the knowledge and analysis of all protein structures. One such method due to Eisenberg and colleagues, involves 3D-1D alignment. Another such is threading.
The 3D-1D method for prediction of protein 3D structures involves the construction of a library of
3D profiles for the known protein structures.
Main chain Side chain
P2
B2
B3
Inside or outside
P1
B1
E
Pol
ar o
r ap
olar
B1 B1 B1 . . . .
A
R
.
.
.
.
.
Y
W
-0.66 -0.79 -0.91 . . . .
-1.67 -1.16 -2.16 . . . .
. . .
. . .
. . .
. . .
. . .
0.18 0.07 0.17 . . . .
1.00 1.17 1.05 . . . .
Am
ino
acid
s
3D-1D score
Environmental class
A
R
.
.
.
.
.
Y
W
12 -66 46 . . . . . . . . . .
-32 -80 -34 . . . . . . . . . .
. . .
. . .
. . .
. . .
. . .
-94 112 -210 . . . . . . . . . .
-214 102 -135 . . . . . . . . . .
1 2 3 . . . . . . . . . . N
3D profile
Residue number
DNA - - - - agacgagataaatcgattacagtca - - - -
Transcription
RNA - - - - agacgagauaaaucgauuacaguca - - - -
Translation
Protein - - - - - DEI - - - -
Protein FoldingProblem
Exon Intron Exon Intron Exon
Protein
Splicing
Gene Structure I
Gene Structure II
AUG - X1…Xn - STOP
SPLICING
TRANSLATION
3’
pre-mRNA
mRNA
protein sequenceprotein 3D structure
Exon 1 Exon 2 Exon 3 Exon 4
Intron 1 Intron 2 Intron 3
DNATRANSCRIPTION
5’
Gene Structure III
5’ 3’
DNAExon 1 Exon 2 Exon 3 Exon 4
Intron 1 Intron 2 Intron 3
polyA signalPyrimidinetract
Branchpoint
CTGAC
Splice siteCAG
Splice siteGGTGAG
TranslationInitiationATG
Stop codonTAG/TGA/TAA
PromoterTATA
Additional Difficulties
• Alternative splicing
SPLICING
TRANSLATION
pre-mRNA
• Pseudo genes
ALTERNATIVE SPLICING
TRANSLATION
Protein IIProtein I
mRNA
DNA
Approaches to Gene Recognition
• HomologyBLASTN, TBLASTX, Procrustes
• Statistical de novo GRAIL, FGENEH, Genscan, Genie, Glimmer
• Hybrid GenomeScan, Genie
F(*,*,*,…)
Example: GlimmerGene Finding in Microbial DNA
• No introns
• 90% coding
• Shorter genomes (less than 10 million bp)
• Lots of data
TranslationInitiationATG
Stop codonTAG/TGA/TAA
ORF
Gene Structure in Prokaryotes
Simplest Hidden Markov Gene Model
Intergene
ATG TAA
Coding
A 0.25C 0.25G 0.25T 0.25
A 0.9C 0.03G 0.04T 0.03
1
1
0.9
0.1
0.1
0.9
The Viterbi Algorithm
A A C A G T G A C T C T
Example: GenscanGene Finding in Human DNA
• Introns
• 5% coding
• Large genome (3 billion bp)
• Alternative splicing
The
Gen
scan
HM
M
Examples of functional sites.
Molecule Processing Functional sites Interacting moleculesDNA
RNA
Protein
ReplicationTranscription
Post-transcriptionalprocessingTranslation
Post-translational processing
Protein sorting
Protein function
Replication originPromotorEnhancerOperator and other prokaryoticregulators
Splice site
Translation initiation site
Cleavage sitePhosphorylation and othermodification sitesATP binding sitesSignal sequence, localizationsignalsDNA binding sitesLigand binding sitesCatalytic sites
Origin recognition complexRNA polymeraseTranscription factorRepressor, etc
Spliceosome
Ribosome
ProteaseProtein kinase, etc.
Signal recognition particle
DNALigandsMany different molecules
Protein sorting prediction
The final step in informational expression of proteins involves their sorting to the appropriate location within or outside the cell. The information for correct localization is usually located within the protein itself.
Sequence Alignment Problem
• Task:Task: find common patterns shared by multiple Protein sequences
• Importance:Importance: understanding function and structures; revealing evolutionary relationship, data organizing …
• Types:Types: Pairwise vs. Multiple; Global vs. Local.
• Approaches:Approaches: criteria-based (extension of pairwise methods) versus model-based (EM, Gibbs, HMM)
Outline of Liu-Lawrence approachOutline of Liu-Lawrence approach
• Local alignment --- Examples, the Gibbs sampling algorithm
• A simple multinomial model for block-motifs and the Bayesian missing-data formulation.
Possible but not covered here:
• Motif sampler: repeated motifs.
• The hidden Markov model (its decoupling)
• The propagation model and beyond
Example: search for regulatory binding sites
• Gene Transcription and Regulation– Transcription initiated by RNA polymerase binding at
the so-called promoter region (TATA-box; or -10, -35)
– Regulated by some (regulatory) proteins on DNA “near” the promoter region.
– These binding sites on DNA are often “similar” in composition.
AUG
Translation startPromoter region
Enhancers and repressors Starting codon
RNA polymerase
5’ 3’
The particular dataset
• 18 DNA segments, each of length 105 bps.• There are at least one CRP binding sites, known
experimentally, in each sequence.• The binding sites are about 16-19 base pairs long,
with considerable variability in their contents.• Interested in seeing if we can find these sites
computationally.
The Data Set
Truth?
Example: H-T-H proteins
HTH: sequence-specific DNA binding, gene regulation. Motifs occur as local isolated structures. The whole 3-D
structures are known and very different. 30 sequences with known HTH positions chosen. The set
represents a typically diverse cross section of HTH seq. Width of the motif pattern is assumed to be in the range
from 17 to 22. The criterion “information per parameter” is used to determine the optimal width, 21.
Heuristic convergence developed (multiple restarts with IPP monitored)
Finding
Local Alignment of Multiple Sequences
Motif
width = w
length nk
a1
a2
ak
Alignment variable: A={a1, a2, …, ak}
Local
Objective:Objective: find the “best” common patterns.
Motif Alignment Model
Motif
width = w
length nk
a1
a2
ak
The missing data: Alignment variable: A={a1, a2, …, ak}
• Every non-site positions follows a common multinomial with p0=(p0,1 ,…, p0,20)• Every position i in the motif element follows probability distribution pi=(pi,1 ,…, pi,20)
The Tricky Part: The alignment variable A={a1, a2, …, ak} is not observable
• General Missing Data problem:– Unobserved data in each datum– Object of the DP optimization (path)– Potentially observable– Examples
• Alignment
• RNA structure
• Protein secondary structure
Statistical Models
• How do we describe patterns? – frequencies of amino acid types.
– multinomial distribution --- more generally a “model”
A typical aligned motif
Multinomial Distribution MotifPositions
1 2 3 4 5 6
Seq 1 I G K P I ESeq 2 V G D P G ESeq 3 V G D D A DSeq 4 I G Q H P E
Seq 5 L S G P E E
Model Mi for i-th column:
(ki,1, ki,2, …, ki,20) ~ Multinom (k, pi ) where pi=(pi,1 ,…, pi,20)
A total ofk sequences
Estimation for the “pattern”
• The maximum likelihood:
• Bayesian estimate:
– Prior: pi ~ Dirichlet (ii), “pseudo-counts”
– Posterior: [pi | obs ]~ Dirichlet (iki,1,…, i +ki,20)
– Posterior Mean:
– Posterior Distribution:
$pk
kij
ij=
++
=ill
ijijij k
kp
α
αˆ
kkk ii =++ ,, L
Dealing with the missing data• Let =(p0 , p1 , … , pw ), “parameter”, A={a1, a2, …, aK}
• Iterative sampling:Iterative sampling: P( | A, Data); P(A | , Data)
Draw from [ | A, Data], then draw from [A | , Data]
• Predictive Updating:Predictive Updating: pretend that K-1 sequences have been aligned. We stochastically predict for the K-th sequence!!
ak ?
a1
a2
a3
The Algorithm • Initialized by choosing random starting
positions
• Iterate the following steps many times:– Randomly or systematically choose a sequence, say,
sequence k, to exclude.
– Carry out the predictive-updating step to update ak
• Stop when not much change observed, or some criterion met.
)0()0(2
)0(1 ,......,, Kaaa
a2
a1The PU-Step
ak ?
a3
1. Compute predictive frequencies of each position i in motif
cij= count of amino acid type j at position i.
c0j = count of amino acid type j in all non-site positions.
qij= (cij+bj)/(K-1+B), B=b1+ + bK “pseudo-counts”
2. Sample from the predictive distriubtion of ak .
P a lq
qk
i R l i
R l ii
wk
k
( ), ( )
, ( )
= + ∝ +
+=⊆
Phase-shift and Fragmentation
• Sometimes get stuck in a local shift optimum
• How to “escape” from this local optimum?– Simultaneous move: A A+A+a1+, … ,
aK+
– Use a Metropolis step: accept the move with prob=p,
ak ?
: True motif locations
pA R
A R=
+min{ ,
( )( )
}π
π|
|Compare entropies between new columns and left-out ones.
Acknowledgements for slides used
PDB: protein figures
Lior Pachter: gene finding
Jun Liu: Gibbs sampler
top related