probabilistic suffix trees maria cutumisu cmput 606 october 13, 2004

Probabilistic Suffix Trees

Maria CutumisuCMPUT 606

October 13, 2004

Goal Provide efficient prediction for

protein families Probabilistic Suffix Trees (PSTs) are

variable length Markov models (VMMs)

Conceptual MapProbabilistic Suffix Trees

Suffix TreesVariable Length Markov Model

Background PSTs were introduced by Ron, Singer,

Tishby Bejerano, Yona made further

improvements (bPST) Poulin – efficient PSTs (ePSTs) PSTs a.k.a. prediction suffix trees

Higher Order Markov Models A k-order Markov chain: history of

length k for conditional probabilities Exponential storage requirements Order of the chain increases, amount

of training data increases to improve estimation accuracy

Variable Length Markov Models (VMMs) Space and parameter-estimation

efficient variable length of the history sequence

for prediction only needed parameters are stored

Created from less training data

>T1 Test sequenceAHGSGYMNAB Training

sequences

Is T1 in the training set?

VMMs P(sequence) = product of the

probabilities of each amino acid given those that precede it

Conditional probability based on the context of each amino acid

A context function k(·) can select the history length based on the context x1 . . . xi−1 xi

VMMs were first introduced as PSTs

PSTs VMMs for efficient prediction Pruned during training to contain

only required parameters bPST: represents histories ePST: represents sequences

bPST Used to represent the histories for

prediction instead of the training sequences

The possible histories are the reversed strings of all the substrings of the training sequences

Prediction with bPSTs The conditional probabilities P(xi|xi-1…)

are obtained for each position by tracing a path from the root that matches the preceding residues

Construction bPST We add histories for the training data Nodes: parameters that estimate the

conditional probabilities γhistory(a) = P(a|history) PbPST (xi|xi−1, . . . , x1) = γx1...xi−1(xi) if in bPST else γx2...xi−1(xi) if in bPST etc. else γ(xi)

bPST created and pruned using 010010010011110101100010111

P(01001) = P(0)P(1|0)P(0|01)P(0|010)P(1|0100) = γ(0) γ0(1) γ01(0) γ0

*(0) γ00*(1)

= (13/27)(8/13)(5/8)(5/13)(4/5) = 10400/182520 = 0.057

Complexity bPST bPST building process requires O(Ln2)

time L is the length limit of the tree n is the total length of the training set.

bPST building requires all training sequences at once (in order to get all the reverse substrings) and cannot be done online (the bPST cannot be built as the training data is encountered)

Prediction: O(mL), m = sequence length

Improved bPST Idea: tree with training sequences n length of all training sequences m length of tested sequence Result (theoretical):

linear time building O(n) linear time prediction O(m).

Efficient PST (ePST) Used for predicting protein function ePST represents sequences Linear construction and prediction

Example ePST

Prediction with ePSTs The probabilities for a substring are

obtained for each position by tracing the path representing the sequence from the root

If the entire sequence is not found in the tree, suffix links are followed

Construction ePST ePSTs gain efficiency by representing

the training sequences in the PST Nodes store counts of the

subsequence occurrences in the training data (with respect to the complete tree)

Conditional probabilities deducted from the counts are stored as well

Example ePST - AYYYA

Complexity ePST Linear time and space with regards to

the combined length of the training sequences O(n)

Linear prediction time O(m)

Advantages and Disadvantages Avoid exponential space requirements

and parameter estimation problems of higher order Markov chains

Pruned during training to contain only required parameters

bPSTs for local predictions: more accurate prediction than global

Loss in classification performance: Pfarm, SCOP

Conclusions PSTs require less training and

prediction time than HMMs Despite some loss in classification

performance, PSTs compete with HMMs due to PSTs reduced resource demands

PSTs take advantage of VMMs higher order correlations

References Brett Poulin, Sequence-based Protein

Function Prediction, Master Thesis, University of Alberta, 2004

G Bejerano, G Yona, Modeling protein families using probabilistic suffix trees, RECOMB’99

G Bejerano, Algorithms for variable length markov chain modeling, Bioinformatics Applications Note, 20(5):788–789, 2004

PSTs and HMMs “HMMs do not capture any higher-order

correlations. An HMM assumes that the identity of a particular position is independent of the identity of all other positions.” [1]

PSTs are variable length Markov models for efficient prediction. The prediction uses the longest available context matching the history of the current amino acid.

For protein prediction in general, “the main advantage of PSTs over HMMs is that the training and prediction time requirements of PSTs are much less than for the equivalent HMMs.” [1]

Suffix Trees (ST)

bPST Histories added to the tree must

occur more frequently than a threshold Pmin

The substrings are added in order of length from smallest to largest

bPST vs ST The string s is only added to the tree if the

resulting conditional probability at the node to be created will be greater than the minimum prediction probability γmin + α and the probability for the prefix of the string is different (with some ratio r) from the probability assigned to the next shortest substring suf(s) (which is already in the tree). After all the substrings are added to the tree, the probabilities are smoothed according to the parameter γmin.

The smoothing (as calculated by the equation below) prevents any probability from being less than γmin

probabilistic suffix trees maria cutumisu cmput 606 october 13, 2004

Documents

suffix trees, suffix arrays and suffix trays richard cole...

1. 2 overview suffix tries on-line construction of suffix...

cse 549: suffix tries & suffix trees

cmput 412 actuation

cmput 366: intelligent systems and cmput 609

algorithm engineering „suffix-bäume und...

transmembrane protein prediction project presentation cmput...

suffix -or.ppsx

cmput 114 – first class c. jones, winter 2003slide # 1...

frefix & suffix

cmput 412 binary image processing

suffix -ible

cmput 366: intelligent systems and ... -...

factor oracle, suffix oracle 1 factor oracle suffix oracle

suffix trees

cmput 671 hard problems

suffix trees and suffix arrays

ppt suffix

compressed suffix arrays and suffix trees with

cmput 391 – database management systems department of...