position-specific scoring matrices decrease complexity through info analysis training set including...

Position-specific scoring matricesPosition-specific scoring matricesDecrease complexity through info analysisDecrease complexity through info analysis

Training set including sequences from two Nostocs

71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA

Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA

71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT

Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT

71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC

71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG

Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG

71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT

Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT

71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA

Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA

Might increase performance of our PSSM if we can filter out columns that don’t have “enough information”

Not every column is as well conserved – some seem to be more informative about what a binding site looks like!

Position-specific scoring matricesPosition-specific scoring matricesDecrease complexity through info analysisDecrease complexity through info analysis

Uncertainty (Hc) = - [pic log2(pic)]

Uncertainty Distribution

0 0.2 0.4 0.6 0.8 1

fraction

Confusing!!!

Digression on information theoryDigression on information theoryUncertainty when all outcomes are equally probable

Pretend we have a machine that spits out an infinitely long string of nucleotidesBut that each one is EQUALLY LIKELY to occur:

Pretend we have a machine that spits out an infinitely long string of nucleotides:

G A T G A C T C …

How uncertain are we about the outcome BEFORE we see each new character produced by the machine?

Intuitively, this uncertainty will depend on howmany possibilities exist

Digression on information theoryDigression on information theory

If the possibilities are:A or G or C or T

Quantifying uncertainty when outcomes are equally probable

One way to quantify uncertainty is to ask: “What is the minimum number of questions required to

remove all ambiguity about the outcome?”

How many yes/no questions do we need to ask?

A G C T

Quantifying uncertainty when outcomes are equally probable

M = 4 ((Alphabet size)

H = log2(M)

Number of decisions dependson the height of the decision tree

With M = 4 we are uncertain by log2(4) = 2 bits before each new symbol is made by our machine

Digression on information theoryDigression on information theoryUncertainty when all outcomes are equally probable

After we have received a new symbol from our machinewe are less uncertain

Intuitively, when we become less uncertain, it means we have gained information

Information = uncertaintybefore - uncertaintyafterInformation = Hbefore - Hafter

Note that only in the special case whereno uncertainty remains after (Hafter = 0) does

information = Hbefore

In the real world this never happens In the real world this never happens because of because of noisenoise in the system!! in the system!!

Necessary when outcomes are not equally probable!

Fine, but where did we get

H = Pi log2Pi ?i =1

Digression on information theoryDigression on information theoryUncertainty with unequal probabilities

Now our machine produces a string of symbols, but some are more likely to occur than others:

PA = 0.6PG = 0.1PC = 0.1 PT = 0.2

Now our machine produces a string of symbols, but we know that some are more likely to occur than others:

A A A T A A G T C …

Now how uncertain are we about the outcome BEFORE we see each new character?

Are we more or less surprised when we see an“A” or a “C”?

Now our machine produces a string of symbols, but we know that some are more likely to occur than others:

Do you agree that we are less surprised to see an “A” than we are to see a “G”?

A GA AA A T T C

Do you think that the output of our new machine is more or less uncertain?

Digression on information theoryDigression on information theoryWhat about when outcomes are not equally probable?

log2M = -log2M-1

= - log2(1/M)= - log2(P)

P = 1/M = probability of a symbol appearing

Digression on information theoryDigression on information theoryWhat about when outcomes are not equally probable?

PA = 0.6PG = 0.1PC = 0.1 PT = 0.2

Pi = 1i =1

Remember that the probabilities of all possiblesymbols must sum to 1!

Digression on information theoryDigression on information theoryHow surprised are we to see a given symbol?

ui = - log2(Pi)

UA = -log2(0.6) = 0.7UG = -log2(0.1) = 3.3UC = -log2(0.1) = 3.3UT = -log2(0.2) = 2.3

(where Pi = probability of ith symbol)

Ui is therefore called the surprisal for symbol i

What does the surprisal for a symbol haveto do with uncertainty?

ui = - log2(Pi)

Uncertainty is the average surprisal for the infinite string of symbols produced by our machine

the “surprisal”

Digression on information theoryDigression on information theoryLet’s first imagine that our machine only

produces a finite string of N symbols

N Nii =1

Ni is equal to the number of times each symbol occurred in a string of length N

NA = 5NG = 2NC = 1 NT = 1

For example, for the string “AAGTAACGA”

For every Ni, there is a corresponding surprisal ui

therefore the average surprisal for N symbols will be:

Niuii =1

Nii =1

Niuii =1

For every Ni, there is a corresponding surprisal ui

therefore the average surprisal for N symbols will be:

Nui Pi

uiRemember that Pi is simply the probability

of generating the ith symbol!

But wait! We also already defined Ui !!

Pii =1

Congratulations! This is Claude Shannon’s famousformula defining uncertainty when the probability of

each symbol is unequal!

ui = - log2(Pi)

Pii =1

log2(Pi)-

Therefore:

Uncertainty is largest when all symbols are equally probable!

(1/M)i =1

log2(1/M)-Heq

How does it reduce assuming equiprobable symbols?

- (1/M log21/M)

M- (1/M log21/M)

Uncertainty when M = 2

Pii =1

log2(Pi)-H

Uncertainty is largest when all symbols are equally probable!

OK, but how much information is present in each column?

Information (R) = Hbefore - Hafter

Mlog2 Pii =1

log2(Pi)-Now before and after refers to before and after we

examined the contents of a column

http://weblogo.berkeley.edu/

Sequence logos graphically display howMuch information is present in each column

position-specific scoring matrices decrease complexity through info analysis training set including...

Documents

investigating sequences and series arithmetic sequences

using dna sequences obtain sequence align sequences, number...

arithmetic sequences class work -...

chapter 4 sequences and mathematical induction. 4.1...

sequences and series chapter 1. chapter 1: sequences and...

· ≥4 µg/m3. increase/decrease > 4 days ....

index page session no: 9 file name : devb(w)-e2.doc reply...

notes on infinite sequences and series 1. sequences 1.1

gathering sequences: blast - t-coffeeselecting diverse...

objectives: to identify and extend patterns in sequences. to...

sequence alignment. sequences much of bioinformatics...

recursive sequences vs. arithmetic sequences

development bureau library of standard general …€¦ ·...

devb tc-8-2010 101116 · devb tc(w) no. 8/2010 enhanced...

another set of sequences, sub-sequences, and sequences of...

beatty sequences and langford sequences*

sequences and arithmetic sequences

10 landscape and visual impacts 10.1 … on greening of...

index page estimates of expenditure 2013-14 director of...

index page file name : devb(w)-e1.doc reply question reply...