sequence alignment and searching

Upload: mudit-misra

Post on 07-Apr-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Sequence Alignment and Searching

    1/54

    Sequence Alignment and Searching

    By: Sarika goyal

  • 8/6/2019 Sequence Alignment and Searching

    2/54

    What is the purpose of sequence alignment?

    Identification of homology and homologous sites inrelated sequences

    Inference of evolutionary history that lead to thedifferences in observed sequences

  • 8/6/2019 Sequence Alignment and Searching

    3/54

    The Problem

    Biological problem

    Finding a way to compare and represent

    similarity or dissimilarity betweenbiomolecular sequences (DNA, RNA or aminoacid)

  • 8/6/2019 Sequence Alignment and Searching

    4/54

    The Problem

    Computational problem

    Finding a way to perform inexact or approximatematching of subsequences within strings ofcharacters

    Sequence comparison and alignment is a centralproblem in computational biology: High sequence similarity usually => structural

    or functional similarity

  • 8/6/2019 Sequence Alignment and Searching

    5/54

    Substring and subsequence

    Example:

    xyz is a subsequence within axayaz, butNOT a substring

    Characters in a substring must be contiguous

  • 8/6/2019 Sequence Alignment and Searching

    6/54

    Types of comparisons and alignment methods

    LOLOCALCAL GLOBALGLOBAL

    TWO SEQUENCESTWO SEQUENCES

    (Pairwise alignment)(Pairwise alignment)

    Database search againstquery sequences

    BLASTalgorithm

    Comparison of twosequences;

    First step in multiplesequence alignment

    THREE OR MORETHREE OR MORESEQUENCESSEQUENCES

    (Multiple alignment)(Multiple alignment)

    Defining consensussequences, protein

    structural motifs anddomains, regulatoryelements in DNA etc.

    Determination of conservedresidues and domains;

    Introductory step inmolecular phylogenetic

    analysis

    According tosequenceCoverage:According to

    number ofsequences:

  • 8/6/2019 Sequence Alignment and Searching

    7/54

    Introduction to sequence alignment

    Given two text strings:

    First string = a b c d e

    Second string = a c d ef

    a reasonable alignment would

    bea b c d e -

    a - c d e f

    We must choose criteria sothat algorithm can choosethe best alignment.

    For the sequences gctgaacgand ctataatc:

    An uninformative alignment:

    -------gctgaacg

    ctataatc-------

    An alignment without gaps

    gctgaacg

    ctataatc

    An alignment with gaps

    gctga-a--cg--ct-ataatc

    And another

    gctg-aa-cg

    -ctataatc-

  • 8/6/2019 Sequence Alignment and Searching

    8/54

    The dotplot (1)

    A simple picture that gives an overview of the similaritiesbetween two sequences

    Dotplot showing identities between sequences (DOROTHYHODGKIN) and(DOROTHYCROWFOOTHODGKIN):

    Letters corresponding toisolatedmatches are shown innon-bold type. The longestmatching regions, shown inboldface, are DOROTHY andHODGKIN. Shorter matchingregions, such as the OTH ofdorOTHy and RO ofdoROthyand cROwfoot, are noise.

  • 8/6/2019 Sequence Alignment and Searching

    9/54

    The dotplot (2)

    Dotplot showing identitiesbetween a repetitivesequence(ABRACADABRACADABRA)and itself. The repeatsappear on severalsubsidiary diagonalsparallel to the maindiagonal.

  • 8/6/2019 Sequence Alignment and Searching

    10/54

    The dotplot (3)

    Dotplot showingidentities between thepalindromic sequence

    MAX I STAY AWAY ATSIX AM and itself. Thepalindrome revealsitself as a stretch ofmatchesperpendicularto the main diagonal.

  • 8/6/2019 Sequence Alignment and Searching

    11/54

    Dotplots and sequence alignments

    Any path through thedotplot from upper left tolower right passes througha succession of cells, eachof which picks out a pair ofpositions, one from therow and one from thecolumn, that correspond inthe alignment; or thatindicates a gap in one ofthe sequences. The pathneed not pass throughfilled-in points only.However, the more filled-inpoints on the diagonalsegments of the path, themore matching residues inthe alignment.

    Corrseponding alignment:

    DOROTHY--------HODGKIN

    DOROTHYCROWFOOTHODGKIN

  • 8/6/2019 Sequence Alignment and Searching

    12/54

    Measures of sequence similarity

    Two measures of distance between two character strings: The Hamming distance, defined between two strings of equal

    length, is the number of positions with mismatching characters. The Levenshtein, or edit distance, between two strings of not

    necessarily equal length, is the minimal number of 'editoperations' required to change one string into the other, wherean edit operation is a deletion, insertion or alteration of a single

    character in either sequence.

    agtc

    cgta Hamming distance = 2

    ag-tcc

    cgctca Levenshtein distance = 3

  • 8/6/2019 Sequence Alignment and Searching

    13/54

    The Edit Distance between two strings

    Definition: The edit distance between two strings is

    defined as the minimum number of editoperations insertions, deletions, &substitutions needed to transform the firststring into the second. For emphasis, notethat matches are not counted.

    Example: AATT and AATG

    Distance = 1 (edit operation of substitution)

  • 8/6/2019 Sequence Alignment and Searching

    14/54

    String alignment

    An edit transcript is a way to represent a particulartransformation of one string into another Emphasizes point mutations in the model

    An alignment displays a relationship between twostrings Global alignment means for each string, entire string

    is involved in the alignment Examples:

    A A G C A

    A A _ C _

  • 8/6/2019 Sequence Alignment and Searching

    15/54

    Sequence diversion

    Sequences may have diverged from commonancestor through mutations: Substitution (AAGC AAGT) Insertion (AAG AAGT) Deletion (AAGC AAG)

  • 8/6/2019 Sequence Alignment and Searching

    16/54

    Scoring schemes

    In molecular biology, certain changes are more likelyto occur than others Amino acid substitutions tend to be conservative In nucleotide sequences, transitions are more frequent

    than transversions

    -> We want to give different weights todifferent edit operations

    Example: a DNA substitution matrix:

    a g c t

    a 20 10 5 5

    g 10 20 5 5

    c 5 5 20 10

    t 5 5 10 20

  • 8/6/2019 Sequence Alignment and Searching

    17/54

    BLAST the workhorse of bioinformaticshttp://www.ncbi.nlm.nih.gov/BLAST

    BLAST = Basic localalignment searchtool

    When you have anucleotide orprotein sequence

    that you want tosearch againstsequence databases to determine what

    the sequence is to find related

    sequences

    (homologs)

  • 8/6/2019 Sequence Alignment and Searching

    18/54

    Different BLAST programs

  • 8/6/2019 Sequence Alignment and Searching

    19/54

    DNA can be translated into six potential proteins

    5 CAT CAA

    5 ATC AAC

    5 TCA ACT

    5 GTG GGT

    5 TGG GTA

    5 GGG TAG

    DNA potentially encodes six proteins

    5 CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3

    3 GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5

  • 8/6/2019 Sequence Alignment and Searching

    20/54

    (1) Choose the sequence (query)

    (2) Select the BLAST program

    (3) Choose the database to search

    (4) Choose optional parameters

    Then click BLAST

    Four components to a BLAST search

  • 8/6/2019 Sequence Alignment and Searching

    21/54

    Step 1: Choose your sequence

    Sequence can be input inFASTA format or as accessionnumber

  • 8/6/2019 Sequence Alignment and Searching

    22/54

    Example of the FASTA format for a BLAST query

  • 8/6/2019 Sequence Alignment and Searching

    23/54

    Step 2: Choose the BLAST program

    Program Input Database 1

    blastn DNA DNA

    1blastp protein protein

    6blastx DNA protein

    6tblastn protein DNA 36

    tblastx DNA DNA

  • 8/6/2019 Sequence Alignment and Searching

    24/54

    Step 2: Choose the BLAST program

  • 8/6/2019 Sequence Alignment and Searching

    25/54

    Step 3: choose the database

    nr = non-redundant (most general database)

    dbest = database of expressed sequence tags

    dbsts = database of sequence tag sites

    pdb = sequences derived from 3d structure of proteins

    Patents = Nucleotide sequences derived from patentdivision of GenBank.

  • 8/6/2019 Sequence Alignment and Searching

    26/54

    Step 4a: Select optional search parameters

    You can...

    choose the organism to search

    turn filtering on/off change the substitution matrix

    change the expect (e) value

    change the word size

    change the output format

  • 8/6/2019 Sequence Alignment and Searching

    27/54

    Step 4a: Select optional search parameters

    CD search

  • 8/6/2019 Sequence Alignment and Searching

    28/54

    Step 4a: Select optional search parameters

    Entrez!

    Filter

    Scoring matrix

    Word sizeExpect

    organism

  • 8/6/2019 Sequence Alignment and Searching

    29/54

    filtering

  • 8/6/2019 Sequence Alignment and Searching

    30/54

  • 8/6/2019 Sequence Alignment and Searching

    31/54

    Step 4b: optional formatting parameters

    Alignment view

    Descriptions

    Alignments

    page 97

  • 8/6/2019 Sequence Alignment and Searching

    32/54

    BLAST format options

  • 8/6/2019 Sequence Alignment and Searching

    33/54

    Alignments Views

    Pairwise

    Standard BLAST alignment in pairs of query sequence and data

    match.

    Query: 251 tgaccggtaacgaccgcaccctggacgtcatggcgctggatgtggtgtggacggcgga 3

    |||||||||| ||||||| |||||||| |||||| ||||||||||||||||||||

    Sbjct: 248575 tgaccggtaaagaccgcagcttggacgtgatggcgatggatgtggtgtggacagcgga248634

  • 8/6/2019 Sequence Alignment and Searching

    34/54

  • 8/6/2019 Sequence Alignment and Searching

    35/54

    database

    query

    program

  • 8/6/2019 Sequence Alignment and Searching

    36/54

  • 8/6/2019 Sequence Alignment and Searching

    37/54

    High scores

    low e values

  • 8/6/2019 Sequence Alignment and Searching

    38/54

    Algorithm for Blast

  • 8/6/2019 Sequence Alignment and Searching

    39/54

    How a BLAST search works

    The central idea of the BLAST

    algorithm is to confine attention

    to segment pairs that contain aword pair of length W with a score

    of at least T.

    Altschul et al. (1990)

  • 8/6/2019 Sequence Alignment and Searching

    40/54

    BLAST algorithm principles

    (Basic Local Alignment Search Tool)

    Main idea:

    1. Construct a dictionary of all thewords in the query

    2. Initiate a local alignment for eachword match between query and DB

    query

    DB

  • 8/6/2019 Sequence Alignment and Searching

    41/54

    BLAST Original Version

    Dictionary:

    All words of length k

    Alignment initiated between words of alignment score T

    Alignment:

    Ungapped extensions until score below statistical threshold

    Output:

    All local alignments with score > statistical threshold

    Background

  • 8/6/2019 Sequence Alignment and Searching

    42/54

    The BLAST Algorithm

    3 Stages

    Preprocessing of the query

    Generation of hits

    Extension of the hits

    Background

  • 8/6/2019 Sequence Alignment and Searching

    43/54

    Step 1: Preprocessing of the query

    Speed gained by minimizing search space Alignments require word hits ( word size = W)

    Sequence 1

    word hits

    Sequen

    ce

    2

    Background

  • 8/6/2019 Sequence Alignment and Searching

    44/54

    Step 1: Preprocessing of the query (Contd.)

    Threshold score = T

    Neighborhood words of RGD

    Wand Tmodulate speed andsensitivity

    RGD 17

    KGD 14

    QGD 13RGE 13

    EGD 12

    HGD 12

    NGD 12

    RGN 12

    AGD 11

    MGD 11

    RAD 11

    RGQ 11

    RGS 11

    RND 11

    RSD 11

    SGD 11

    TGD 11

    T=12

    Background

  • 8/6/2019 Sequence Alignment and Searching

    45/54

    Step 2: Generation of hits

    A hit is made with one or several successive pairs ofsimilar words.

    All the possible hits between the query sequence and

    sequences from databases are calculated in this way.

    query

    Background

  • 8/6/2019 Sequence Alignment and Searching

    46/54

    Step 3: Extension

    Each hit is extended in both directions. Extension is terminated when the maximum score drops

    belowX.

    DB

    query

    scan

    Background

  • 8/6/2019 Sequence Alignment and Searching

    47/54

    The BLAST Algorithm: Summary

  • 8/6/2019 Sequence Alignment and Searching

    48/54

    BLAST Original Version

    A C G A A G T A A G G T C C A G T

    C

    C

    C

    T

    T

    C

    C

    T

    G

    G

    A

    T

    T

    G

    C

    G

    AExample:

    k = 4,

    T = 4

    The matching word GGTCinitiates an alignment

    Extension to the left andright with no gaps

    Output:GTAAGGTCC

    GTTAGGTCC

  • 8/6/2019 Sequence Alignment and Searching

    49/54

    Evalue - number of unrelated databank sequences expectedto yield same or higher score by pure chance

    BLAST results: List of hits

  • 8/6/2019 Sequence Alignment and Searching

    50/54

    Fundamental unit of the BLASTalgorithm output

    HSP (high scoring pair) Aligned fragments of query and detectedsequence with similarity score exceeding a

    set cutoff value

    Score = 61 (27.8 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 10/17

    (58%), Positives = 16/17 (94%)

    Query: 81 SGDLSMLVLLPDEVSDL 97

    +GD+SM +LLPDE++D+

    Sbjct: 259 AGDVSMFLLLPDEIADV 275

    E-value (Expectation)

    HSP

    BLAST results: High scoring pairs (HSPs)

  • 8/6/2019 Sequence Alignment and Searching

    51/54

    BLAST Confidence measures

    Score and bit-score :depend on scoring method

    E-value (Expect value) : number of unrelated databasesequences expected to yield same or higher score by purechance

    The Expect value is used as a convenient way to create a significancethreshold for reporting results. When the Expect value is increasedfrom the default value of 10, a larger list with more low-scoring hits canbe reported. (E-value approching zero => significant alignment)

    An E value of 1 assigned to a hit can be interpreted as in a database ofthe current size one might expect to see 1 match with a similar scoresimply by chance.

  • 8/6/2019 Sequence Alignment and Searching

    52/54

    Statistical Terminology

    True positive: A hit returned from a database search which is homologouswith the query sequence. GOOD

    False positive: A hit returned from a database search which is nothomologous with the query sequence. BAD

    True negative: A sequence which is not homologous with the querysequence is not returned from database search. GOOD

    False negative: A sequence which is homologous with the querysequence is not returned from database search. BAD

    Sensitivity: A program which is sensitive picks up on most true positive.

    Selectivity: A program which is selective does not include false positives.

  • 8/6/2019 Sequence Alignment and Searching

    53/54

    Conclusion

    Treat BLAST searches as scientific experiments

    Dont use the default parameters Default changes from time to time

    BLAST is quite complicated. But its very useful.

  • 8/6/2019 Sequence Alignment and Searching

    54/54

    Thanks