proteins, pair hmms, and alignment. cs262 lecture 8, win06, batzoglou a state model for alignment...

44
Proteins, Pair HMMs, and Alignment

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

Proteins, Pair HMMs, and Alignment

Page 2: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

A state model for alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

M(+1,+1)

I(+1, 0)

J(0, +1)

Alignments correspond 1-to-1 with sequences of states M, I, J

Page 3: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Let’s score the transitions

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

M(+1,+1)

I(+1, 0)

J(0, +1)

Alignments correspond 1-to-1 with sequences of states M, I, J

s(xi, yj)

s(xi, yj) s(xi, yj)

-d -d

-e -e

Page 4: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Alignment with affine gaps – state version

Dynamic Programming:

M(i, j): Optimal alignment of x1…xi to y1…yj ending in M

I(i, j): Optimal alignment of x1…xi to y1…yj ending in I

J(i, j): Optimal alignment of x1…xi to y1…yj ending in J

The score is additive, therefore we can apply DP recurrence formulas

Page 5: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Alignment with affine gaps – state version

Initialization:M(0,0) = 0; M(i, 0) = M(0, j) = -, for i, j > 0I(i,0) = d + ie; J(0, j) = d + je

Iteration:

M(i – 1, j – 1)M(i, j) = s(xi, yj) + max I(i – 1, j – 1)

J(i – 1, j – 1)

e + I(i – 1, j)I(i, j) = max

d + M(i – 1, j)

e + J(i, j – 1)J(i, j) = max

d + M(i, j – 1)

Termination:Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }

Page 6: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Brief introduction to the evolution of proteins

• Protein sequence and structure• Protein classification• Phylogeny trees• Substitution matrices

Page 7: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Muscle cells and contraction

Page 8: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Actin and myosin during muscle movement

Page 9: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Actin structure

Page 10: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Actin sequence

• Actin is ancient and abundant Most abundant protein in cells 1-2 actin genes in bacteria, yeasts, amoebas Humans: 6 actin genes

-actin in muscles; -actin, -actin in non-muscle cells• ~4 amino acids different between each version

MUSCLE ACTIN Amino Acid Sequence

1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG VMVGMGQKDS YVGDEAQSKR 61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP VLLTEAPLNP KANREKMTQI 121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN VPIYEGYALP HAIMRLDLAG 181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD FEQEMATAAS SSSLEKSYEL 241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI MKCDIDIRKD LYANNVLSGG 301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG GSILASLSTF QQMWITKQEY 361 DESGPSIVHR KCF

Page 11: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

A related protein in bacteria

Page 12: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Relation between sequence and structure

Page 13: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Protein Phylogenies

• Proteins evolve by both duplication and species divergence

Page 14: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Protein Phylogenies – Example

Page 15: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Structure Determines Function

What determines structure?

• Energy• Kinematics

How can we determine structure?

• Experimental methods• Computational predictions

The Protein Folding Problem

Page 16: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Primary Structure: Sequence

• The primary structure of a protein is the amino acid sequence

Page 17: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Primary Structure: Sequence

• Twenty different amino acids have distinct shapes and properties

Page 18: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Primary Structure: Sequence

A useful mnemonic for the hydrophobic amino acids is "FAMILY VW"

Page 19: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Secondary Structure: , , & loops

helices and sheets are stabilized by hydrogen bonds between backbone oxygen and hydrogen atoms

Page 20: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Tertiary Structure: A Protein Fold

Page 21: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

PDB GrowthN

ew

PD

B s

tru

ctu

res

Page 22: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Only a few folds are found in nature

Page 23: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Protein classification

• Number of protein sequences grows exponentially• Number of solved structures grows exponentially• Number of new folds identified very small (and close to constant)• Protein classification can

Generate overview of structure types Detect similarities (evolutionary relationships) between protein sequences Help predict 3D structure of new protein sequences

Morten Nielsen,CBS, BioCentrum, DTU

SCOP release 1.69, Class # folds # superfamilies # families

All alpha proteins 218 376 608

All beta proteins 144 290 560

Alpha and beta proteins (a/b) 136 222 629

Alpha and beta proteins (a+b) 279 409 717

Multi-domain proteins 46 46 61

Membrane & cell surface 47 88 99

Small proteins 75 108 171

Total 945 1539 2845

Classification of 25,973 protein structures in PDB

Page 24: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Protein world

Protein fold

Protein structure classification

Protein superfamily

Protein family

Morten Nielsen,CBS, BioCentrum, DTU

Page 25: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Structure Classification Databases

• SCOP Manual classification (A. Murzin) scop.berkeley.edu

• CATH Semi manual classification (C. Orengo) www.biochem.ucl.ac.uk/bsm/cath

• FSSP Automatic classification (L. Holm) www.ebi.ac.uk/dali/fssp/fssp.html

Morten Nielsen,CBS, BioCentrum, DTU

Page 26: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Major classes in SCOP

• Classes All proteins All proteins and proteins (/) and proteins (+) Multi-domain proteins Membrane and cell surface proteins Small proteins Coiled coil proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 27: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

All : Hemoglobin (1bab)

Morten Nielsen,CBS, BioCentrum, DTU

Page 28: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

All : Immunoglobulin (8fab)

Morten Nielsen,CBS, BioCentrum, DTU

Page 29: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Triosephosphate isomerase (1hti)

Morten Nielsen,CBS, BioCentrum, DTU

Page 30: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

: Lysozyme (1jsf)

Morten Nielsen,CBS, BioCentrum, DTU

Page 31: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Families

• Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity)

• Families are further subdivided into Proteins

• Families are divided into Species The same protein may be found in several

species

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 32: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Superfamilies

• Proteins which are (remotely) evolutionarily related

Sequence similarity low

Share function

Share special structural features

• Relationships between members of a superfamily may not be readily recognizable from the sequence alone

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 33: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Folds

• >~50% secondary structure elements arranged in the same order in sequence and in 3D

• No evolutionary relation

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 34: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Substitutions of Amino Acids

Mutation rates between amino acids have dramatic differences!

Page 35: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Substitution Matrices

BLOSUM matrices:

1. Start from BLOCKS database (curated, gap-free alignments)2. Cluster sequences according to > X% identity

3. Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes

4. Estimate

P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)

Page 36: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Probabilistic interpretation of an alignment

An alignment is a hypothesis that the two sequences are related by evolution

Goal:

Produce the most likely alignment

Assert the likelihood that the sequences are indeed related

Page 37: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

A Pair HMM for alignments

MP(xi, yj)

IP(xi)

JP(yj)

1 – 2

1 –

1 –

This model generates two sequences simultaneously

Match/Mismatch state M:P(x, y) reflects substitution frequencies between pairs of amino acids

Insertion states I, J:P(x), P(y) reflect frequencies of each amino acid

: set so that 1/2 is avg. length before next match

: set so that 1/(1 – ) is avg. length of a gap

Model MM

optional

Page 38: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

A Pair HMM for unaligned sequences

IP(xi)

JP(yj)

1 1

Two sequences are independently generated from one another

P(x, y | R) = P(x1)…P(xm) P(y1)…P(yn) = i P(xi) j P(yj)

Model RR

Page 39: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

To compare ALIGNMENT vs. RANDOM hypothesis

Every pair of letters contributes:

MM• (1 – 2) P(xi, yj) when matched

P(xi) P(yj) when gapped

RR• P(xi) P(yj) in random model

Focus on comparison of

P(xi, yj) vs. P(xi) P(yj)

MP(xi, yj)

IP(xi)

JP(yj)

1 – 2

1 –

1 –

IP(xi)

JP(yj)

1 1

Page 40: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

To compare ALIGNMENT vs. RANDOM hypothesis

Idea:We will divide alignment score by the random score, and take logarithms

Let P(xi, yj)

s(xi, yj) = log ––––––––– + log (1 – 2) P(xi) P(yj)

(1 – ) P(xi) d = – log –––––––––––––

(1 – 2) P(xi)

P(xi) e = – log ––––––

P(xi)

=Defn substitution score

=Defn gap initiation penalty

=Defn gap extension penalty

Page 41: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

The meaning of alignment scores

• The Viterbi algorithm for Pair HMMs corresponds exactly to global alignment DP with affine gaps

VM(i, j) = max { VM(i – 1, j – 1), VI( i – 1, j – 1) – d, Vj( i – 1, j – 1) } + s(xi, yj)

VI(i, j) = max { VM(i – 1, j) – d, VI( i – 1, j) – e }

VJ(i, j) = max { VM(i – 1, j) – d, VI( i – 1, j) – e }

s(.,.) ~how often a pair of letters substitute one another 1/mean length of next gap 1/mean arrival time of next gap

Page 42: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

The meaning of alignment scores

Match/mismatch scores: P(xi, yj)

s(a, b) log –––––––––– (ignore log(1 – 2) for the moment) P(xi) P(yj)

Example:DNA regions between human and mouse genes have average conservation of 80%

1. What is the substitution score for a match?

P(a, a) + P(c, c) + P(g, g) + P(t, t) = 0.8 P(x, x) = 0.2P(a) = P(c) = P(g) = P(t) = 0.25s(x, x) = log [ 0.2 / 0.252 ] = 1.68

• What is the substitution score for a mismatch?

P(a, c) +…+P(t, g) = 0.2 P(x, yx) = 0.2/12 = 0.167s(x, y x) = log[ 0.167 / 0.252 ] = -1.42

• What ratio matches/(matches + mism.) gives score 0?1.67/(1.67+1.42) = 54 % (~halfway between random and conserved

model)

Page 43: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

Substitution Matrices

BLOSUM matrices:

1. Start from BLOCKS database (curated, gap-free alignments)2. Cluster sequences according to > X% identity

3. Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes

4. Estimate

P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)

Page 44: Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

CS262 Lecture 8, Win06, Batzoglou

BLOSUM matrices

BLOSUM 50 BLOSUM 62

(The two are scaled differently)