cse182-l4: scoring matrices, dictionary matching

30
Fa05 CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching

Upload: kato-fernandez

Post on 03-Jan-2016

30 views

Category:

Documents


5 download

DESCRIPTION

CSE182-L4: Scoring matrices, Dictionary Matching. Class Mailing List. [email protected] To subscribe, send email to [email protected] You can subscribe from the course web page Use the list for all course related queries, discussions,…. Silly Quiz. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

CSE182-L4: Scoring matrices, Dictionary Matching

Page 2: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Class Mailing List

[email protected]

• To subscribe, send email to – [email protected]

• You can subscribe from the course web page• Use the list for all course related queries,

discussions,…

Page 3: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Silly Quiz

• Name a famous Bioinformatics Researcher

• Name a famous Bioinformatics Researcher who is a woman

Page 4: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Scoring DNA

• DNA has structure.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 5: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

DNA scoring matrices

• So far, we considered a simple match/mismatch criterion.

• The nucleotides can be grouped into Purines (A,G) and Pyrimidines.

• Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 6: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Scoring proteins

– Scoring protein sequence alignments is a much more complex task than scoring DNA

• Not all substitutions are equal– Problem was first worked on by Pauling and

collaborators– In the 1970s, Margaret Dayhoff created the first

similarity matrices.• “One size does not fit all”• Homologous proteins which are evolutionarily close

should be scored differently than proteins that are evolutionarily distant

• Different proteins might evolve at different rates and we need to normalize for that

Page 7: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

PAM 1 distance

• Two sequences are 1 PAM apart if they differ in 1 % of the residues.

• PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart]

1% mismatch

Page 8: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

PAM1 matrix

• Align many proteins that are very similar– Is this a problem?

• PAM1 distance is the probability of a substitution when 1% of the residues have changed

• Estimate the frequency Pb|a of residue a being substituted by residue b.

• S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb)

Page 9: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

PAM 1

Page 10: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

PAM distance

• Two sequences are 1 PAM apart when they differ in 1% of the residues.

• When are 2 sequences 2 PAMs apart?

1 PAM

1 PAM

2 PAM

Page 11: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Higher PAMs

• PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b)

• PAM2 = PAM1 * PAM1 (Matrix multiplication)

• PAM250

– = PAM1*PAM249

– = PAM1250

Page 12: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Note: This is not the score matrix: What happens as you keep increasing the power?

Page 13: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Scoring using PAM matrices

• Suppose we know that two sequences are 250 PAMs apart.

• S(a,b) = log10(Pab/PaPb)= log10(Pb|a/Pb) = log10(PAM250(a,b)/Pb)

Page 14: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

BLOSUM series of Matrices

• Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions

• A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database.

• BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability.– In practice BLOSUM62 seems to work very well.

Page 15: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

PAM vs. BLOSUM

• What is the correspondence?

• PAM1 Blosum1• PAM2 Blosum2

• Blosum62

• PAM250 Blosum100

Page 16: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Dictionary Matching, R.E. matching, and position specific scoring

Page 17: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Dictionary Matching

• Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string.

• How fast can this be done?

1:POTATO2:POTASSIUM3:TASTE

P O T A S T P O T A T O

dictionary

database

Page 18: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Dict. Matching & string matching

• How fast can you do it, if you only had one word of length m?– Trivial algorithm O(nm) time– Pre-processing O(m), Search O(n) time.

• Dictionary matching

– Trivial algorithm (l1+l2+l3…)n

– Using a keyword tree, lpn (lp is the length of the longest pattern)

– Aho-Corasick: O(n) after preprocessing O(l1+l2..)

• We will consider the most general case

Page 19: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Direct Algorithm

P O P O P O T A S T P O T A T OP O T A T OP O T A T OP O T A T OP O T A T O P O T A T O

Observations:• When we mismatch, we (should) know something about

where the next match will be.• When there is a mismatch, we (should) know something

about other patterns in the dictionary as well.

Page 20: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

P O T A T O

T UIS M

S ETA

The Trie Automaton

• Construct an automaton A from the dictionary– A[v,x] describes the transition from node v to a node w upon

reading x.– A[u,’T’] = v, and A[u,’S’] = w– Special root node r– Some nodes are terminal, and labeled with the index of the

dictionary word.

1:POTATO2:POTASSIUM3:TASTE

1

2

3

w

vu

S

r

Page 21: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

An O(lpn) algorithm for keyword matching

• Start with the first position in the db, and the root node.

• If successful transition– Increment current

pointer– Move to a new node– If terminal node

“success”• Else

– Retract ‘current’ pointer– Increment ‘start’ pointer– Move to root & repeat

Page 22: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Illustration:

P O T A T O

T UIS M

S ETA

P O T A S T P O T A T Ol c

v

S

1

Page 23: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Idea for improving the time

P O T A S T P O T A T O

• Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match– Then prefix(pattern j) = suffix [ first c-l characters of

pattern(i))

l c

1:POTATO2:POTASSIUM3:TASTE

P O T A S S I U MT A S T E

Pattern i

Pattern j

Page 24: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Improving speed of dictionary matching

• Every node v corresponds to a string sv that is a prefix of some pattern.

• Define F[v] to be the node u such that su is the longest suffix of sv

• If we fail to match at v, we should jump to F[v], and commence matching from there

• Let lp[v] = |su|

P O T A T O

T UIS M

S ETA

1 2 3 4 5

67

89 10

11S

Page 25: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

An O(n) alg. For keyword matching

• Start with the first position in the db, and the root node.

• If successful transition– Increment current pointer– Move to a new node– If terminal node “success”

• Else (if at root)– Increment ‘current’ pointer– Mv ‘start’ pointer– Move to root

• Else – Move ‘start’ pointer forward– Move to failure node

Page 26: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Illustration

P O T A S T P O T A T O

P O T A T O

T UIS M

S ETA

lc

v S

1

Page 27: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Time analysis

• In each step, either c is incremented, or l is incremented

• Neither pointer is ever decremented (lp[v] < c-l).

• l and c do not exceed n• Total time <= 2n

P O T A S T P O T A T Ol c

Page 28: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Blast: Putting it all together

• Input: Query of length m, database of size n

• Select word-size, scoring matrix, gap penalties, E-value cutoff

Page 29: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Blast Steps

1. Generate an automaton of all query keywords.2. Scan database using a “Dictionary Matching” algorithm

(O(n) time). Identify all hits.3. Extend each hit using a variant of “local alignment”

algorithm. Use the scoring matrix and gap penalties.4. For each alignment with score S, compute the bit-

score, E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached.

5. Output results.

Page 30: CSE182-L4:  Scoring matrices, Dictionary Matching

Fa05 CSE 182

Protein Sequence Analysis

• What can you do if BLAST does not return a hit?– Sometimes, homology (evolutionary similarity) exists at

very low levels of sequence similarity.

• A: Accept hits at higher P-value. – This increases the probability that the sequence similarity

is a chance event.– How can we get around this paradox?– Reformulated Q: suppose two sequences B,C have the

same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish?