statistical significance of alignment scores

46
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington [email protected]

Upload: calix

Post on 22-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Statistical significance of alignment scores. Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington [email protected]. Outline. Responses from last class Revision Hypothesis testing and the p-value - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistical significance of alignment scores

Statistical significance of alignment scores

Prof. William Stafford NobleDepartment of Genome Sciences

Department of Computer Science and EngineeringUniversity of Washington

[email protected]

Page 2: Statistical significance of alignment scores
Page 3: Statistical significance of alignment scores

Outline

• Responses from last class• Revision• Hypothesis testing and the p-value• Extreme value distribution• Python

Page 4: Statistical significance of alignment scores

One-minute responses

• Be patient with our programming skills.• Explain the Python code before we start solving

problems.• We should do an evening tutorial with the tutors

for Python dictionaries and sys.argv.• I am not used to the sys library and working from a

.txt file. Otherwise, Python programming is fine.• What are the things being aligned? Are they

proteins?

Page 5: Statistical significance of alignment scores

Are these proteins homologs?SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY | | | | | | | | |

SEQ 2: QFFPLMPPAPYWILATDYENLPLVYSCTTFFWLF

SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY

| | | ||||||||| | | |

SEQ 2: QFFPLMPPAPYWILDATYKNYALVYSCTTFFWLF

SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY

||| | || | ||||||||| | |||||||

SEQ 2: RVVPLMPSAPYWILDATYKNYALVYSCDVTYKLF

YES (score = 153)

MAYBE (score = 104)

NO (score = 69)

Page 6: Statistical significance of alignment scores

Significance of scores

Sequencealignment algorithm

HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT

LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE

45

Low score = unrelatedHigh score = homologs

How high is high enough?

Page 7: Statistical significance of alignment scores

An empirical histogram

• Roll a 20-sided die 1000 times.• Keep a tally of the number of times each value

is observed.

Page 8: Statistical significance of alignment scores

A table of results1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

41 52 67 45 34 55 53 71 60 50 41 43 54 56 50 46 48 35 61 38

Page 9: Statistical significance of alignment scores

A table of results1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

41 52 67 45 34 55 53 71 60 50 41 43 54 56 50 46 48 35 61 38

A histogram plots the number of times or the frequency with which each value of a given variable (e.g., the die value) is observed.

Page 10: Statistical significance of alignment scores

After 21,000 rolls

Se-ries1

0

200

400

600

800

1000

1200

Die value

Number of observations

As the number of observations increases, the histogram becomes smoother.

Page 11: Statistical significance of alignment scores

Frequency histogram = distribution

Se-ries1

0

0.1

0.20.3

0.40.5

0.6

0.7

0.8

0.91

Die value

Frequency of observations

• Divide through by the total number of observations to get frequencies.

• The resulting histogram is called a distribution.• The sum of the bars in the distribution is 1.• This particular distribution is “uniform.”

Page 12: Statistical significance of alignment scores

Distribution from two dice

2 3 4 5 6 7 8 9 10 11 120

0.05

0.1

0.15

0.2

Dice value

Frequency of observations

• There is only one way to get a score of 2.• How many ways can you get a 7 from two six-sided

dice?• This distribution is approximately normal (Gaussian).

Page 13: Statistical significance of alignment scores

Normal distribution

This is probably the most common distribution in science.The area under a continuous distribution curve is 1.

Page 14: Statistical significance of alignment scores

The null hypothesis• We are interested in characterizing the distribution

of scores from sequence comparison algorithms.• We would like to measure how surprising a given

score is, assuming that the two sequences are not related.

• The assumption is called the null hypothesis.• The purpose of most statistical tests is to determine

whether the observed results provide a reason to reject the hypothesis that the results are merely a product of chance factors.

Page 15: Statistical significance of alignment scores

Sequence similarity score distribution

• Search a randomly generated database of DNA sequences using a randomly generated DNA query.

• What will be the form of the resulting distribution of pairwise sequence comparison scores?

Sequence comparison score

Frequency ?

Page 16: Statistical significance of alignment scores

Empirical score distribution

• The picture shows a distribution of scores from a real database search using BLAST.

• This distribution contains scores from non-homologous and homologous pairs.

High scores from homology.

Score

Freq

uenc

y

Page 17: Statistical significance of alignment scores

Empirical null score distribution

• This distribution is similar to the previous one, but generated using a randomized sequence database.

Page 18: Statistical significance of alignment scores

Computing a p-value

• The probability of observing a score >X is the area under the curve to the right of X.

• This probability is called a p-value.

• p-value = Pr(data|null)

Out of 1685 scores, 28 receive a score of 20 or better. Thus, the p-value associated with a score of 20 is approximately 28/1685 = 0.0166.

Page 19: Statistical significance of alignment scores

Problems with empirical distributions

• We are interested in very small probabilities.• These are computed from the tail of the

distribution.• Estimating a distribution with accurate tails is

computationally very expensive.

Page 20: Statistical significance of alignment scores

A solution

• Solution: Characterize the form of the distribution mathematically.

• Fit the parameters of the distribution empirically, or compute them analytically.

• Use the resulting distribution to compute accurate p-values.

Page 21: Statistical significance of alignment scores

Extreme value distribution

This distribution is characterized by a larger tail on the right.

Page 22: Statistical significance of alignment scores

Computing a p-value

• The probability of observing a score >4 is the area under the curve to the right of 4.

• This probability is called a p-value.• p-value = Pr(data|null)

Page 23: Statistical significance of alignment scores

Extreme value distribution

Compute this value for x=4.

Page 24: Statistical significance of alignment scores

Computing a p-value

• Calculator keys: 4, +/-, inv, ln, +/-, inv, ln, +/-, +, 1, =

• Solution: 0.018149

Page 25: Statistical significance of alignment scores

Scaling the EVD

• An extreme value distribution derived from, e.g., the Smith-Waterman algorithm will have a characteristic mode μ and scale parameter λ.

• These parameters depend upon the size of the query, the size of the target database, the substitution matrix and the gap penalties.

Page 26: Statistical significance of alignment scores

An exampleYou run the Smith-Waterman algorithm and get a score of 45. You then run Smith-Waterman with the same query but using a shuffled version of the database, and you fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = 0.693. What is the p-value associated with 45?

Page 27: Statistical significance of alignment scores

An example

You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = 0.693. What is the p-value associated with 45?

Page 28: Statistical significance of alignment scores

What p-value is significant?• The most common thresholds are 0.01 and 0.05.• A threshold of 0.05 means you are 95% sure that the

result is significant.• Is 95% enough? It depends upon the cost associated

with making a mistake.• Examples of costs:

– Doing expensive wet lab validation.– Making clinical treatment decisions.– Misleading the scientific community.

• Most sequence analysis uses more stringent thresholds because the p-values are not very accurate.

Page 29: Statistical significance of alignment scores

Multiple testing

• Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations.

• Assume that all of the observations are explainable by the null hypothesis.

• What is the chance that at least one of the observations will receive a p-value less than 0.05?

Page 30: Statistical significance of alignment scores

Multiple testing• Say that you perform a statistical test with a 0.05 threshold, but you repeat the

test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05?

• Pr(making a mistake) = 0.05• Pr(not making a mistake) = 0.95• Pr(not making any mistake) = 0.9520 = 0.358• Pr(making at least one mistake) = 1 - 0.358 = 0.642

• There is a 64.2% chance of making at least one mistake.

Page 31: Statistical significance of alignment scores

Bonferroni correction

• Assume that individual tests are independent. • Divide the desired p-value threshold by the number

of tests performed.• For the previous example, 0.05 / 20 = 0.0025.• Pr(making a mistake) = 0.0025• Pr(not making a mistake) = 0.9975• Pr(not making any mistake) = 0.997520 = 0.9512• Pr(making at least one mistake) = 1 - 0.9512 = 0.0488

Page 32: Statistical significance of alignment scores

Database searching

• Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use?

Page 33: Statistical significance of alignment scores

Database searching• Say that you search the non-redundant protein

database at NCBI, containing roughly one million sequences. What p-value threshold should you use?

• Say that you want to use a conservative p-value of 0.001.

• Recall that you would observe such a p-value by chance approximately every 1000 times in a random database.

• A Bonferroni correction would suggest using a p-value threshold of 0.001 / 1,000,000 = 0.000000001 = 10-9.

Page 34: Statistical significance of alignment scores

E-values• A p-value is the probability of making a mistake.• The E-value is the expected number of times that the

given score would appear in a random database of the given size.

• One simple way to compute the E-value is to multiply the p-value times the size of the database.

• Thus, for a p-value of 0.001 and a database of 1,000,000 sequences, the corresponding E-value is 0.001 × 1,000,000 = 1,000.

BLAST actually calculates E-values in a more complex way.

Page 35: Statistical significance of alignment scores
Page 36: Statistical significance of alignment scores
Page 37: Statistical significance of alignment scores
Page 38: Statistical significance of alignment scores

Summary• A distribution plots the frequency of a given type of observation.• The area under the distribution is 1.• Most statistical tests compare observed data to the expected result

according to the null hypothesis.• Sequence similarity scores follow an extreme value distribution, which is

characterized by a larger tail.• The p-value associated with a score is the area under the curve to the

right of that score.• Selecting a significance threshold requires evaluating the cost of making a

mistake.• Bonferroni correction: Divide the desired p-value threshold by the number

of statistical tests performed.• The E-value is the expected number of times that the given score would

appear in a random database of the given size.

Page 40: Statistical significance of alignment scores

Why write programs clearly?

• Many people may want to read and understand your program– Collaborators and colleagues– Future lab members– People who read your published paper– Your supervisor– Your future self

• Well written code can more easily be re-used later.

Page 41: Statistical significance of alignment scores

White space

• The layout of the program on the page should reflect the logical structure of the program.

• Insert blank lines to indicate related lines of code.

Page 42: Statistical significance of alignment scores

Comments

• Above all, use comments.• Don’t use too many comments.• Don’t use too few comments.• Adopt a consistent commenting style.• Be sure to comment particularly tricky lines of

code.

Page 43: Statistical significance of alignment scores

Naming

• Naming is more important than you think.• Avoid single letter variable names.• Use abbreviations sparingly.• Think about your variables names• Don’t be afraid to change confusing names.

Page 44: Statistical significance of alignment scores

#!/usr/bin/pythonimport sys

USAGE = """USAGE: revcomp.py <string>

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string formed by reversing the symbols of s, then taking the complement of each symbol.

Given: A DNA string s of length at most 1000 bp.

Return: The reverse complement of s."""

# Dictionary storing the complementarity of DNA nucleotides.revComp = { "A":"T", "T":"A", "G":"C", "C":"G" }

# Parse the command line.if (len(sys.argv) != 2): sys.stderr.write(USAGE) sys.exit(1)dna = sys.argv[1]

# Do the reverse complement.for index in range(len(dna) - 1, -1, -1): char = dna[index] if char in revComp: sys.stdout.write(revComp[char]) else: sys.stderr.write("Cannot complement %s.\n" % char) sys.exit(1)sys.stdout.write("\n”)

Page 45: Statistical significance of alignment scores

#!/usr/bin/pythonimport sys

USAGE = """USAGE: revcomp.py <string>

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string formed by reversing the symbols of s, then taking the complement of each symbol.

Given: A DNA string s of length at most 1000 bp.

Return: The reverse complement of s."""

# Dictionary storing the complementarity of DNA nucleotides.revComp = { "A":"T", "T":"A", "G":"C", "C":"G" }

# Parse the command line.if (len(sys.argv) != 2): sys.stderr.write(USAGE) sys.exit(1)dna = sys.argv[1]

# Do the reverse complement.for index in range(len(dna) - 1, -1, -1): char = dna[index] if char in revComp: sys.stdout.write(revComp[char]) else: sys.stderr.write("Cannot complement %s.\n" % char) sys.exit(1)sys.stdout.write("\n”)

Page 46: Statistical significance of alignment scores

Problem #1

• Given:– a bin width– a file containing one

number per line, and• Return: the frequency

associated with each bin

• File: 1 1 2 3 4 5 7 9 10 10 10• Bin width: 1• Output:1 22 13 14 15 16 07 18 09 110 3