pairwise alignment prelab.pdf

Upload: maila-escudero

Post on 14-Apr-2018

250 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Pairwise Alignment prelab.pdf

    1/87

    PAIRWISE ALIGNMENT

  • 7/27/2019 Pairwise Alignment prelab.pdf

    2/87

    SEQUENCES ARE RELATED

    Darwin: all organisms

    are related through

    descent withmodification

    Sequences are related

    through descent with

    modification Similar molecules have

    similar functions in

    different organisms

    Phylogenetic tree based onribosomal RNA: three domains of life

  • 7/27/2019 Pairwise Alignment prelab.pdf

    3/87

    WHY COMPARE SEQUENCES?

    To determine evolutionary

    relationships

    To decide if two proteins

    (or genes) are related

    structurally or functionally

    Protein 1: binds oxygen

    Sequence similarity

    Protein 2: binds oxygen ?

  • 7/27/2019 Pairwise Alignment prelab.pdf

    4/87

    WHY COMPARE SEQUENCES?

    To identify domains or motifs that are shared between

    proteins

  • 7/27/2019 Pairwise Alignment prelab.pdf

    5/87

    TERMINOLOGIES

    Similaritythe extent to which nucleotide or protein

    sequences are related. It is based upon identity plus

    conservation. Identitythe extent to which two sequences are invariant.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    6/87

    TERMINOLOGIES

    Conservationchanges at a specific position of an amino

    acid (or less commonly, DNA) sequences that preserve the

    physicochemical properties of the original residue.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    7/87

    CONSERVED RESIDUES

    Residues conserved among various G protein coupled

    receptors are highlighted in green

  • 7/27/2019 Pairwise Alignment prelab.pdf

    8/87

    CONSERVATION OF FUNCTION

    Alignments can reveal which parts of the sequences arelikely to be important for the function, if the proteins are

    involved in similar processes. In parts of the sequence of a protein which are not very

    critical for its function, random mutations can easilyaccumulate.

    In parts of the sequence that are critical for the function of

    the protein, hardly any mutations will be accepted; nearly allchanges in such regions will destroy the function.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    9/87

    INSERTIONS/DELETIONS AND

    PROTEIN STRUCTURE

    loop structures: insertions/deletions

    here not so significant

    Why is it that two similar sequences may have large

    insertions/deletions?

    some insertions and deletions may not significantlyaffect the structure of a protein

  • 7/27/2019 Pairwise Alignment prelab.pdf

    10/87

    COMPARING THE PROTEIN KINASE KRAF_HUMAN AND THE

    UNCHARACTERIZED O22558 FROM ARABIDOPSIS USING BLAST546 AA

    Score = 185 bits (464), Expect = 1e-45

    Identities = 107/283 (37%), Positives = 172/283 (59%), Gaps = 15/283 (5%)

    Query: 337 DSSYYWEIEASEVMLSTRIGSGSFGTVYKGKWHG-DVAVKILKVVDPTPEQFQAFRNEVA 395

    D + WEI+ +++ + ++ SGS+G +++G + +VA+K LK E + F EV

    Sbjct: 274 DGTDEWEIDVTQLKIEKKVASGSYGDLHRGTYCSQEVAIKFLKPDRVNNEMLREFSQEVF 333

    Query: 396 VLRKTRHVNILLFMGYMTKD-NLAIVTQWCEGSSLYKHLHVQETKFQMFQLIDIARQTAQ 454

    ++RK RH N++ F+G T+ L IVT++ S+Y LH Q+ F++ L+ +A A+

    Sbjct: 334 IMRKVRHKNVVQFLGACTRSPTLCIVTEFMARGSIYDFLHKQKCAFKLQTLLKVALDVAK 393

    Query: 455 GMDYLHAKNIIHRDMKSNNIFLHEGLTVKIGDFGLATVKSRWSGSQQVEQPTGSVLWMAP 514

    GM YLH NIIHRD+K+ N+ + E VK+ DFG+A V+ SG E TG+ WMAP

    Sbjct: 394 GMSYLHQNNIIHRDLKTANLLMDEHGLVKVADFGVARVQIE-SGVMTAE--TGTYRWMAP 450

    Query: 515 EVIRMQDNNPFSFQSDVYSYGIVLYELMTGELPYSHINNRDQIIFMVGRGYASPDLSKLY 574

    EVI ++ P++ ++DV+SY IVL+EL+TG++PY+ + + +V +G P + K

    Sbjct: 451 EVI---EHKPYNHKADVFSYAIVLWELLTGDIPYAFLTPLQAAVGVVQKG-LRPKIPK-- 504

    Query: 575 KNCPKAMKRLVADCVKKVKEERPLFPQILSSIELLQHSLPKIN 617

    K PK +K L+ C + E+RPLF +I IE+LQ + ++N

    Sbjct: 505 KTHPK-VKGLLERCWHQDPEQRPLFEEI---IEMLQQIMKEVN 543

  • 7/27/2019 Pairwise Alignment prelab.pdf

    11/87

    SIMILARITY AND HOMOLOGY

    Similaritycan be expressed as a percentage. It does not imply

    any reasons for the observed sameness.

    Homologyis an evolutionary term used to describerelationship via descent from a common ancestor.

    Homologous things are often similar, but not always

    (e.g. whale flipper human arm)

    Homology is NEVER expressed as a percentage

  • 7/27/2019 Pairwise Alignment prelab.pdf

    12/87

    HOMOLOGY

    Homologous sequences can be divided into three groups:

    Orthologous sequencessequences that diverged due to

    a speciation event (e.g. human -globin and mouse -globin).

    Paralogous sequencessequences that diverged due to a

    gene duplication event (e.g. human -globin and human

    -globin, various versions of both).

    Xenologous sequencessequences for which the history

    of one of them involves interspecies transfer since the time

    of their common ancestor.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    13/87

    HOMOLOGY

  • 7/27/2019 Pairwise Alignment prelab.pdf

    14/87

    SIMILARITY AND HOMOLOGY

    Sequence homology can be reliably inferred from

    statistically significant similarity over a majority of the

    sequence length. Non-homology CANNOT be inferred from non-similarity

    because non-similar things can still share a common

    ancestor.

    Homologous proteins share common structures, but notnecessarily common sequence or function

    Homology is all or nothing. There is no such thing as 50%

    homology

  • 7/27/2019 Pairwise Alignment prelab.pdf

    15/87

    QUESTION 1

    True or False. Homology is synonymous with similarity

  • 7/27/2019 Pairwise Alignment prelab.pdf

    16/87

    SEARCHING SEQUENCE DATABASES

    When we search a sequence database, we are usually

    looking for related sequences.

    Unfortunately, the algorithms that we have for searchingdatabases, do not search for homology, they search for

    similarity.

    When similarity is found, we must determine if this similarity

    is a result of homology or if it comes from another source.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    17/87

    WHY SEARCH FOR SIMILARITY?

    I have just sequenced something. What is known about the

    thing I sequenced?

    I have a unique sequence. Is there similarity to another genethat has a known function?

    I found a new protein in a lower organism. Is it similar to a

    protein from another species?

    I have decided to work on a new gene. The people in thefield will not give me the plasmid. I need the complete

    cDNA sequence to perform RT-PCR or some other

    experiment.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    18/87

    SEQUENCE ALIGNMENT: DEFINITION

    The process of lining up two or more sequences to achieve

    maximal levels of identity (and conservation, in the case of

    amino acid sequences) for the purpose of assessing thedegree of similarity and the possibility of homology.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    19/87

    SEQUENCE ALIGNMENT

    Comparing sequences provides information as to which

    genes or proteins have the same function

    Sequences are compared by aligning themsliding themalong each other to find the most matches with a few gaps

    An alignment can be scoredcount matches, and can

    penalize mismatches and gaps

  • 7/27/2019 Pairwise Alignment prelab.pdf

    20/87

    QUESTION 2

    Whenever possible, it is better to

    A. Compare proteins than to compare genes

    B. Compare genes than to compare proteins

    Discuss as a group and cite points that defend your argument

  • 7/27/2019 Pairwise Alignment prelab.pdf

    21/87

    IT IS MUCH EASIER TO ALIGN PROTEINS

    4 DNA bases vs. 20 amino acidsless chance similarity

    There are varying degrees of similarity between different

    AAs

    Protein databanks are much smaller than DNA databanks

  • 7/27/2019 Pairwise Alignment prelab.pdf

    22/87

    PAIRWISE ALIGNMENT

    The alignment of two sequences (DNA or protein) is a

    relatively straightforward computational problem.

    Two sequences can always be aligned. Sequence alignments have to be scored.

    Often there is more than one solution with the same

    score.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    23/87

    PAIRWISE ALIGNMENTS: PURPOSE

    identification of sequences with significant similarity to (a)

    sequence(s) in a sequence repository

    identification of all homologous sequences within therepository

    identification of domains with sequence similarity

  • 7/27/2019 Pairwise Alignment prelab.pdf

    24/87

    METHODS OF ALIGNMENT

    By handslide sequences on two lines of a word processor

    Dot plot

    Rigorous mathematical approach

    Dynamic programming (slow, optimal)

    Heuristic methods (fast, approximate)

    BLAST and FASTA (uses word matching and hash tables)

  • 7/27/2019 Pairwise Alignment prelab.pdf

    25/87

    ALIGNMENT BY HAND

    GATCGCCTA_TTACGTCCTGGAC AGGCATACGTA_GCCCTTTCGC

    A scoring system is still essential to find the best alignment

  • 7/27/2019 Pairwise Alignment prelab.pdf

    26/87

    DOTPLOTS

    Not technically an

    alignment

    Gives picture ofcorrespondence between

    pairs of sequences

    Dot represents similarity

    between segments of thetwo sequences

  • 7/27/2019 Pairwise Alignment prelab.pdf

    27/87

    QUESTION 3

    Do diagonals correspond

    to conserved regions?

    A. Yes

    B. No

  • 7/27/2019 Pairwise Alignment prelab.pdf

    28/87

    QUESTION 3 REDUX

    Take note that the dots

    are placed at grid points

    where two sequenceshave identical residues.

    Do diagonals correspond

    to conserved regions?

    A. Yes

    B. No

  • 7/27/2019 Pairwise Alignment prelab.pdf

    29/87

    A LIMITATION TO DOT MATRIX COMPARISON

    Where part of one sequence shares a long stretch ofsimilarity with the other sequence, a diagonal of dots will beevident in the matrix.

    However, when single bases are compared at each position,most of the dots in the matrix will be due to backgroundsimilarity.

    That is, for any two nucleotides compared between the two

    sequences, there is a 1 in 4 chance of a match, assumingequal frequencies of A,G,C and T.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    30/87

    SIMPLE DOT PLOT

    G G C T T G A C C G G

    G

    G

    A

    T

    T

    G

    A

    C

    C

    C

    G

  • 7/27/2019 Pairwise Alignment prelab.pdf

    31/87

    A SOLUTION

    This background noise can be filtered out by comparing

    groups oflnucleotides, rather than single nucleotides, at

    each position. For example, if we compare dinucleotides (l= 2), the

    probability of two dinucleotides chosen at random from

    each sequence matching is 1/16, rather than 1/4.

    Therefore, the number of background matches will be lower:

  • 7/27/2019 Pairwise Alignment prelab.pdf

    32/87

    A FILTERED DOT PLOT

    G G C T T G A C C G G

    G

    G

    A

    T

    T

    G

    A

    C

    C

    C

    G

  • 7/27/2019 Pairwise Alignment prelab.pdf

    33/87

    THE DOT MATRIX ALGORITHM

    The dot-matrix algorithm can be generalized for sequences sand t of sizes m and n, respectively, and window size l.

    For each position in sequence s, compare a window oflnucleotides centered at that position with each window oflnucleotides in sequence t.

    Conceptually, you can think of windows of length lslidingalong each axis, so that all possible windows oflnucleotidesare compared between the two sequences.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    34/87

    I = 3

    G G C T T G A C C G G

    G

    G

    A

    T

    T

    G

    A

    C

    C

    C

    G

  • 7/27/2019 Pairwise Alignment prelab.pdf

    35/87

    DOT MATRIX SEQUENCE

    COMPARISON EXAMPLES

  • 7/27/2019 Pairwise Alignment prelab.pdf

    36/87

    COMPARING A PROTEIN WITH ITSELF

    Proteins can be compared with themselves to show internal

    duplications or repeating sequences.

    A self-matrix produces a central diagonal line through theorigin, indicating an exact match between the x and y axes.

    The parallel diagonals that appear off the central line are

    indicative of repeated sequence elements in different

    locations of the same protein.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    37/87

    HAPTOGLOBIN Haptoglobin is a protein that is secreted into the blood by the

    liver. This protein binds free hemoglobin.

    The concentration of "free" hemoglobin (that is, outside redblood cells) in plasma (the fluid portion of blood) is ordinarilyvery low.

    However, free hemoglobin is released when red blood cellshemolyze for any reason.

    After haptoglobin binds hemoglobin, it is taken up by the liver.

    The liver recycles the iron, heme, and amino acids contained inthe hemoglobin protein.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    38/87

    OUR COMPARISON

    Files used

    1006264A Haptoglobin H2

    DNA sequencing shows that the intragenic duplication withinthe human haptoglobin Hp2 allele was formed by a non-

    homologous, probably random, crossing-over within different

    introns of two Hp1 genes.

    A repeated sequence (starting with ADDGCP...) is observedbeginning at positions 30-90 and 90-150 - probably due to a

    duplication event in one of these locations.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    39/87

    Window: 30 Stringency: 3

    Blosum 62 matrix

  • 7/27/2019 Pairwise Alignment prelab.pdf

    40/87

    SEARCHING FOR REPEATS IN DOTPLOTS

    One of the strengths of dot-matrix searches is that they

    make repeats easy to detect by comparing a sequence

    against itself. In self comparisons, direct repeats appear as diagonals

    parallel to the main line of identity.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    41/87

    COMPARISON OF TWO SIMILAR SEQUENCES Files Used:

    P03035

    Repressor protein from E. coliPhage p22

    RPBPL Repressor protein from E. coliphage Lambda

    Lambda phages infect E. coli. They can be lytic and destroysthe host cell, making hundreds of progeny.

    They can also be lysogenic, and live quietly within the DNA

    of the bacteria. A gene makes the repressor protein that prevents the phage

    from going destructively lytic.

    Phage p22 is a related phage that also makes a repressor.

    Both proteins form a dimer and bind DNA to prevent lysis.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    42/87

  • 7/27/2019 Pairwise Alignment prelab.pdf

    43/87

    LAMBDA REPRESSOR/OPERATOR COMPLEX (1LMB)

  • 7/27/2019 Pairwise Alignment prelab.pdf

    44/87

    DOT MATRIX SEQUENCE COMPARISON

    A row of dots represents a region of sequence similarity.

    Background matching also appears as scattered dots.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    45/87

    Window: 10 Stringency: 1

    Blosum 62 matrix

  • 7/27/2019 Pairwise Alignment prelab.pdf

    46/87

    Window: 10 Stringency: 3

    Blosum 62 matrix

  • 7/27/2019 Pairwise Alignment prelab.pdf

    47/87

    Window: 30 Stringency: 1

    Blosum 62 matrix

  • 7/27/2019 Pairwise Alignment prelab.pdf

    48/87

    Window: 30 Stringency: 3

    Blosum 62 matrix

  • 7/27/2019 Pairwise Alignment prelab.pdf

    49/87

    QUESTION 4

    Which of the following combinations of parameters will

    produce the least background noise?

    A. Low window, low stringency

    B. Low window, high stringency

    C. High window, low stringency

    D. High window, high stringency

  • 7/27/2019 Pairwise Alignment prelab.pdf

    50/87

    DISADVANTAGES TO DOT PLOTS

    While dot-matrix searches provide a great deal of

    information in a visual fashion, they can only be considered

    semi-quantitative, and therefore do not lend themselves tostatistical analysis.

    Also, dot-matrix searches do not provide a precise

    alignment between two sequences.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    51/87

    RIGOROUS ALGORITHMSDYNAMIC PROGRAMMING

  • 7/27/2019 Pairwise Alignment prelab.pdf

    52/87

    ALGORITHM

    An algorithm is a complete, unambiguous procedure for

    solving a specified problem in a finite number of steps.

    Algorithms leave nothing undefined and require no intuitionto achieve their end.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    53/87

    FIVE FEATURES OF AN ALGORITHM:

    An algorithm must stop after a finite number of steps.

    All steps of the algorithm must be precisely defined.

    Input to the algorithm must be specified.

    Output of the algorithm must be specified. There must be at

    least one output.

    An algorithm must be effective - i.e. its operations must be

    basic and doable.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    54/87

    DYNAMIC PROGRAMMING

    Algorithmic technique for optimization problems that have

    two properties:

    Optimal substructure: Optimal solution can be computed from optimalsolutions to subproblems

    Overlapping subproblems: Subproblems overlap such that the total

    number of distinct subproblems to be solved is relatively small

    1

    3

    2

    7

    6

    8

    5

    4

  • 7/27/2019 Pairwise Alignment prelab.pdf

    55/87

    RIGOROUS ALGORITHMS

    Needleman-Wunsch (Global)

    Smith-Waterman (Local)

  • 7/27/2019 Pairwise Alignment prelab.pdf

    56/87

    GLOBAL VS. LOCAL ALIGNMENT

    Global alignment algorithms start at the beginning of two

    sequences and add gaps to each until the end of one is

    reached.

    Local alignment algorithms finds the region (or regions)

    of highest similarity between two sequences and build thealignment outward from there.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    57/87

    GLOBAL VS. LOCAL ALIGNMENT

  • 7/27/2019 Pairwise Alignment prelab.pdf

    58/87

    GLOBAL ALIGNMENT

    The Needleman-Wunsch algorithm creates a globalalignment over the length of both sequences (needle)

    Global algorithms are often not effective for highly divergedsequences - do not reflect the biological reality that twosequences may only share limited regions of conservedsequence.

    Sometimes two sequences may be derived from ancient

    recombination events where only a single functional domain is shared. Global methods are useful when you want to force two

    sequences to align over their entire length

  • 7/27/2019 Pairwise Alignment prelab.pdf

    59/87

    LOCAL ALIGNMENT

    Identify the most similar sub-region shared between two

    sequences

    There is no attempt to force entire sequences into analignment, just those parts that appear to have good

    similarity, according to some criterion.

    Smith-Waterman (water)

  • 7/27/2019 Pairwise Alignment prelab.pdf

    60/87

    LOCAL ALIGNMENTS

    It may seem that one should always use local alignments.

    However, it may be difficult to spot an overall similarity, as

    opposed to just a domain-to-domain similarity, if one usesonly local alignment.

    So global alignment is useful in some cases.

    The popular programs BLAST and FASTA for searching

    sequence databases produce local alignments.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    61/87

    GAPS AND INSERTIONS

    In an alignment, much better correspondence can beobtained between two sequences if a gap can be introducedin one sequence.

    Alternatively, an insertion could be allowed in the othersequence.

    Biologically, this corresponds to a mutation event thateliminates a part of a gene, or introduces new DNA into a

    gene.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    62/87

    GAPS

    Positions at which a letter is paired with a null are called

    gaps.

    Gap scores are typically negative.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    63/87

    QUESTION 5

    Which is more significant? The presence of a gap or the

    length of a gap?

    A. The presence of a gap

    B. The length of a gap

  • 7/27/2019 Pairwise Alignment prelab.pdf

    64/87

    GAPS

    Since a single mutational event may cause the insertion or

    deletion of more than one residue, the presence of a gap is

    considered more significant than the length of the gap.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    65/87

    OPTIMAL ALIGNMENT

    The alignment that is the best, given a defined set ofrules and parameter values for comparing different

    alignments. There is no such thing as the single best

    alignment, since optimality always depends on theassumptions one bases the alignment on.

    For example, what penalty should gaps carry?

    All sequence alignment procedures make some suchassumptions.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    66/87

    PARAMETERS OF SEQUENCE ALIGNMENT

    Scoring systems:

    Each symbol pairing is assigned a numerical value, based on asymbol comparison table.

    Gap penalties:

    Opening: The cost to introduce a gap

    Extension: The cost to elongate a gap

  • 7/27/2019 Pairwise Alignment prelab.pdf

    67/87

    DNA SCORING SYSTEMSVERY SIMPLE

    Match: 1

    Mismatch: 0

    Score = 5

  • 7/27/2019 Pairwise Alignment prelab.pdf

    68/87

    PROTEIN SCORING SYSTEMS

  • 7/27/2019 Pairwise Alignment prelab.pdf

    69/87

    PROTEIN SCORING SYSTEMS

    Amino acids have different biochemical and physical

    properties that influence their relative replaceability in

    evolution.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    70/87

    PROTEIN SCORING SYSTEMS

    Scoring matrices reflect:

    Number of mutations to convert one to another

    Chemical similarity Observed mutation frequencies

    The probability of occurrence of each amino acid

    Widely used scoring matrices:

    PAM

    BLOSUM

  • 7/27/2019 Pairwise Alignment prelab.pdf

    71/87

    PAM MATRICES

    Point Accepted Mutation

    Family of matrices: PAM 80, PAM 120, PAM 250

    The number with a PAM matrix represents the evolutionarydistance between the sequences on which the matrix is

    based.

    PAM 250 = 250 mutations per 100 residues

    Greater numbers denote greater evolutionary distance

  • 7/27/2019 Pairwise Alignment prelab.pdf

    72/87

    PAM MATRICES

    Derived from global alignments ofprotein families. Family

    members share at least 85% identity

    Construction of phylogenetic tree and ancestral sequences

    of each protein family

    Computation of number of replacements for each pair ofamino acids

  • 7/27/2019 Pairwise Alignment prelab.pdf

    73/87

  • 7/27/2019 Pairwise Alignment prelab.pdf

    74/87

    PAM 250 MATRIX

  • 7/27/2019 Pairwise Alignment prelab.pdf

    75/87

    PAMLIMITATIONS

    Based on only one original dataset

    Based mainly on small globular proteins so the matrix is biased

    Examines proteins with few differences (85% identity)

  • 7/27/2019 Pairwise Alignment prelab.pdf

    76/87

    BLOSUM MATRICES

    BLOcks SUbstitution Matrix

    Derived from alignments of domains of distantly related

    proteins Different BLOSUMn matrices are calculated independently

    from blocks (ungapped local alignments)

    BLOSUMn is based on a cluster of BLOCKS of sequences

    that share at least n percent identity BLOSUM 62 represents closer sequences than BLOSUM 45

  • 7/27/2019 Pairwise Alignment prelab.pdf

    77/87

    BLOSUM MATRICES

    Built from BLOCKS database: from the most conservedregions of aligned sequences

    ~2000 blocks from 500 families have been used BLOSUM 62 is the most popular matrix and is the default

    matrix for the standard BLAST program.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    78/87

    BLOSUM 50 MATRIX

    A 5

    R -2 7

    N -1 -1 7

    D -2 -2 2 8

    C -1 -4 -2 -4 13Q -1 1 0 0 -3 7

    E -1 0 0 2 -3 2 6

    G 0 -3 0 -1 -3 -2 -3 8

    H -2 0 1 -1 -3 1 0 -2 10

    I -1 -4 -3 -4 -2 -3 -4 -4 -4 5

    L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5

    K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7

    F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8

    P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10

    S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5

    T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5

    W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15

    Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8

    V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5

    A R N D C Q E G H I L K M F P S T W Y V

    Positive scores on diagonal

    (identities)

    Similar residues get higher

    (positive) scores

    Dissimilar residues get smaller

    (negative) scores

  • 7/27/2019 Pairwise Alignment prelab.pdf

    79/87

    QUESTION 6

    Which is the appropriate matrix to use when comparing

    highly divergent sequences?

    A. BLOSUM with a lower n

    B. PAM with lower n

    C. BLOSUM with a higher n

    D. Both B and C

  • 7/27/2019 Pairwise Alignment prelab.pdf

    80/87

    PAM VS. BLOSUM

    PAM 100 = BLOSUM 90

    PAM 120 = BLOSUM 80

    PAM 160 = BLOSUM 60PAM 200 = BLOSUM 52

    PAM 250 = BLOSUM 45

    More distant

    sequences

    PAM 120 for general use BLOSUM 62 for general use

    PAM 160 for close relations BLOSUM 80 for close relations

    PAM 250 for distant relations BLOSUM 45 for distant relations

  • 7/27/2019 Pairwise Alignment prelab.pdf

    81/87

    TIPS ON CHOOSING A SCORING MATRIX

    Generally, BLOSUM matrices perform better than PAM

    matrices for local similarity searches (Henikoff & Henikoff,

    1993).

    When comparing closely related proteins one should use

    lower PAM or higher BLOSUM matrices, for distantly

    related proteins higher PAM or lower BLOSUM matrices.

    For database searching the commonly used matrix isBLOSUM62.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    82/87

    Lower BLOSUM series meansmore divergence

    Higher PAM series meansmore divergence

    better for finding local

    alignments

    better for finding global

    alignments and remotehomologs

    based on groups of relatedsequences counted as one

    based on minimumreplacement or maximumparsimony

    Built from vast amout of dataBuilt from small amout ofdata

    Built from local alignmentsBuilt from global alignments

    BLOSUMPAM

  • 7/27/2019 Pairwise Alignment prelab.pdf

    83/87

    SCORING INSERTIONS AND DELETIONS

    The creation of a gap is penalized with a negative score

    value

  • 7/27/2019 Pairwise Alignment prelab.pdf

    84/87

    WHY GAP PENALTIES?

  • 7/27/2019 Pairwise Alignment prelab.pdf

    85/87

    WHY GAP PENALTIES?

    The optimal alignment of two similar sequences is usually

    that which

    Maximizes the number of matches and Minimizes the number of gaps

    Permitting the insertion of arbitrarily many gaps can lead to

    high scoring alignments of non-homologous sequences.

    Penalizing gaps forces alignments to have relatively few gaps.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    86/87

    BALANCING GAPS WITH MISMATCHES

    Gaps must get a steep penalty, or else youll end up with

    nonsense alignments.

    In real sequences, multi-base (or amino acid) gaps are quitecommon

    Affine gap penalties give a big penalty for each new gap,

    but a much smaller gap extension penalty.

  • 7/27/2019 Pairwise Alignment prelab.pdf

    87/87

    SCORING INSERTIONS AND DELETIONS