outline scoring matrices probability of matching runs quality of a database match

30

Upload: lucinda-carr

Post on 17-Dec-2015

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match
Page 2: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

OUTLINE

Scoring Matrices Probability of matching runs Quality of a database match

Page 3: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Scoring Matrices

Two alternative models for differences in DNA / protein sequences– Random:

All sequences are random selections of given pool of residues.

– Nonrandom: Sequences are related, Evolutionary process.

Page 4: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Scoring Matrices

– Random: pa: fraction of amino acid a in the pool (probbility of

occurance of the amino acid).

– Nonrandom: qa,b: the probability of finding particular residus a

and b aligned

Page 5: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Scoring Matrices

These two models can be compared:– qa,b / pa pb odds ration

– If qa,b > pa pb nonrondom model is more likely to produce the alignement of these residues.

Page 6: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Scoring Matrices

However, we need a single model– Assume: each position in an alignment will be

regarded as independent.– Odds ratio of alignment:

Page 7: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Scoring Matrices

log – odds ratio:

Negative value: probability of the two residues aligned is greater in the random model than nonrandom model.

Page 8: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Scoring Matrices

EXAMPLE:

If M occurs in the sequences with 0.01 frequency and L occurs with 0.1 frequency. By random pairing, you expect 0.001 amino acid pairs to be M-L. If the observed frequency of M-L is actually 0.003, score of matching M-L will be

log2(3)=1.585

Page 9: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

Statistical significance measures:– p-value: the probability that at least one

sequence will produce the same score by chance

– E-value: expected number of sequences that will produce same or better score by chance

Page 10: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

Analysis of coin tosses :

– “H” indicates a head– p probability of head (p = 0,5)– Probability of 5 heads in a run: 0,55=0,031– The expected number of times that 5H occurs

in above 14 coin tosses: 10x0,031 = 0,31

Page 11: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

Analysis of coin tosses :– The expected number of a length l run of

heads in n tosses:

– Expected length R of the longest match in n tosses:

(Erdös-Rényi)

Page 12: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

Analysis of coin tosses :– Example:

N = 20 R = log2(20) = 4,3

(in 20 coin tosses we expect 4,3 runs of heads, once )

Page 13: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

DNA / protein sequences:

Probability of an individual match

p = 1 / 20 = 0,05

Page 14: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

Expected number of matches: 8x6x0.05 = 2,4

Page 15: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

Expected number of two successive matches:

8x6x0,05x0,05 = 0,12

Page 16: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

Expected number of length l matches:

Expected longest match of two sequences of length m and n:

where p is the probability of occurance of a single residue.

Page 17: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

Expected number of length l matches:

Expected longest match of two sequences of length m and n:

where p is the probability of occurance of a single residue.

Page 18: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

Example:– DNA seq: m = 32, n = 32

R = log4(32x32) = 5

– Amino acid seq: m = 100, n = 80

R = log20(100x80) = 3

Page 19: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

Under even the simplest random models and scoring systems, very little is known about the random distribution of optimal global alignment scores

Statistics for the scores of local alignments, unlike those of global alignments, are well understood.

Page 20: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Probability of matching runs

The optimal ungapped local alignment score follows the Gumble Extreme value distribution.

Because we always choose the best-scoring alignments the distribution will be Gumble Extreme value distribution.

Probability of obtaining an alignment of score S greater than a value x:

Page 21: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Quality of a database match

How good is an alignment ? How believable the results of a database

search ?

Page 22: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Quality of a database match

The alignment reports are selected according to the alignment score.

We need to know:– Whether the score is greater than we would

expect from the alignment of the sequences with a random sequence.

Page 23: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Quality of a database match

Statistical significance measures:– p-value: the probability that at least one

sequence will produce the same score by chance

– E-value: expected number of sequences that will produce same or better score by chance

Page 24: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Quality of a database match

Score Significance of the score.– By applying the Gumble Extreme value

distribution, it is possible to estimate the probability of two random sequences aligned with a score greater than or equal to the alignment score.

– E- value, p – value.

Page 25: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Quality of a database match

E-value depends on:– The sequence length,– The number of sequences in the database,– Alignement score.

Page 26: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Quality of a database match

A good E-value:– The smaller the E-value the better the

alignment,– The threshold value generally is set to 0,01 or

0,001.

Page 27: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Quality of a database match

Page 28: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Quality of a database match

Page 29: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Quality of a database match

Page 30: OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

References

M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008, Garland Science

Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A practical guide to the analysis of genes and proteins”, 2001, Wiley.