ncbi molecular biology resourceshpc.ilri.cgiar.org/beca/training/imbb/lectures/ncbi_2013... ·...

56
NCBI Molecular Biology Resources May 16, 2012

Upload: others

Post on 25-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • NCBI Molecular Biology Resources

    May 16, 2012

  • BLAST

    Sequence VAST

    Structure Entrez

    Text

    Searching the NCBI Databases

  • http://www.ncbi.nlm.nih.gov/blast

  • Why do we need 
sequence similarity searching?

    •  To identify and annotate sequences with… −  incomplete (or no) annotations (GenBank) −  incorrect annotations

    •  To assemble genomes •  To explore evolutionary relationships by… −  finding homologous molecules −  developing phylogenetic trees

    NOTE: Similar sequences may NOT have similar function! NOTE: Homologous molecules may have similar functions.

    Searching with Sequences

  • Why Search Databases?

    • To find out if a new DNA sequence is already fully or partially present in the databanks.

    • To find homologous proteins to a putative coding ORF that might share similar 3D structure.

    • to identify homology (“relatedness”) between a query and entries in a database

  • 3000 Myr

    1000 Myr

    540 Myr

    Common ancestry allows us to infer similar function

    Alzheimer’s Disease

    Ataxia telangiectasia

    Colon cancer

    Pancreatic carcinoma

    Yeast Bacteria Worm Fly Human

    Molecular Evolution

    MLH1 MutL

  • Some Terminology

  • Searching Sequence Databases • Two sequences are homologous when

    they share a common ancestry. This ancestry is reflected in strong sequence similarity.

    • Computationally, threshold limits for sequence similarity can be defined by : –  length of the stretch of similar sequence – percentage of identity between the

    sequence –  statistical measurements, like E-value, P-

    value, Bit-score, etc.

  • Similarity and Homology

    •  Similarity can be expressed as a percentage. It does not imply any reasons for the observed sameness.

    • Homology is an evolutionary term used to describe relationship via descent from a common ancestor.

    • Homologous things are often similar, but not always (whale flipper human arm)

    • Homology is NEVER expressed as a percentage

  • Orthologs vs Paralogs • Homologs can be separated into

    two classes: orthologs and paralogs.

    • Orthologs are homologous genes that perform the same function in different species.

    • Paralogs are homologous genes within a species that may perform different functions.

  • Similarity and Homology •  Sequence homology can be reliably inferred

    from statistically significant similarity over a majority of the sequence length.

    •  Non-homology CANNOT be inferred from non-similarity because non-similar things can still share a common ancestor.

    •  Homologous proteins share common structures, but not necessarily common sequence or function (e.g. FtsZ tubulin)

    •  Remember: pair of sequences either is or isn't homologous. There is no such thing as “64% homologous"

  • Searching sequence databases

    • When we search a sequence database, we are usually looking for related sequences.

    • Unfortunately, the algorithms that we have for searching databases, do not search for homology, they search for similarity.

    • When similarity is found, we must determine if this similarity is a result of homology or it if comes from another source.

  • Substitutions, Insertions, Deletions

    •  Mutation: one of –  switch from one nucleotide to another –  insertion –  deletion

    •  Substitution: a switch in nucleotides which spreads throughout most of a species.

    •  Substitutions, insertions and deletions passed along two independent lines of descent cause a divergence of the two sequences from the original (and from each other):

    ccctaggtccca!

    cgggtatccaa!cggtatgcca!

  • Example

    • For the previous example cggtatgcca→ cgggtatccaa , ccctaggtccca, the two

    descendent sequences align as follows c g g g t a - - t - c c a a c c c - t a g g t c c c - a • “-” (indel) represents an insertion or

    deletion.

  • Pairwise Sequence Alignments

    •  Purpose: •  identification of sequences with significant similarity to (a)

    sequence(s) in a sequence-repository •  identification of all homologous sequences the repository •  identification of domains with sequence similarity

    • Terminology •  Global alignment •  Local alignment

  • Terminology: Global Alignment

    • Finds the optimal alignment over the

    entire length of the two compared sequences

    • Unlikely to detect genes that have evolved by recombination (e.g. domain shuffling) or insertion/deletion of DNA

    • Suitable for sequences of homologous molecules

  • Terminology: Local Alignment

    • short regions of similarity between a pair of sequences.

    • compared sequences can receive high local similarity scores, without the need to have high levels of similarity over their entire length

    • useful when looking for domains within proteins or looking for regions of genomic DNA that contain coding exons

  • Algorithms

    • Needleman-Wunsch –  Exhaustive global alignment –  most rigorous method when aligning conserved

    sequences of similar length (no exon shuffling, insertion/deletion etc)

    •  Smith-Waterman –  Exhaustive local alignment –  alignment does not have to extend along the full

    length of the sequences –  In contrast to N-W alignments initiating at all

    possible positions of the sequence-space will be considered

    –  Can be very slow

  • Basic Local Alignment Search Tool

    •  Calculates similarity for biological sequences •  Finds best local alignments •  A Heuristic approach based on the

    Smith-Waterman algorithm − Searches for matching “words” rather than individual

    residues − Uses statistical theory to determine if a match might

    have occurred by chance

  • Local vs. Global Alignment

    Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492

    human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60

    440 450 human REQLEHI--------KTHELHL . .:: . : ... worm QWKLEDLFNLDSSEYKEASINF 500

    Align program (Lipman and Pearson) -a global alignment protocol-

    BLASTp: protein-protein comparison -a local alignment protocol-

  • Nucleotide Words GTACTGGACATGGACCCTACAGGAA Query:

    Word Size = 11 GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT ...........

    Make a lookup table of words

    Minimum word size = 7 blastn default = 11 megablast default = 28

  • An alignment that BLAST can’t find

    1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

    Here there are no words longer than 6…

    ...for nucleotides there must be an exact match of at least 7.

  • Protein Words GTQITVEDLFYNIATRRKALKN Query:

    Word Size = 3

    Neighborhood Words LTV, MTV, ISV, LSV, etc.

    GTQ TQI QIT ITV TVE VED EDL DLF ...

    Make a lookup table of words

    Word Size can be 2 or 3 (default = 3)

  • Minimum Requirements for a Hit

    •  Nucleotide BLAST requires one exact match •  Protein BLAST requires two neighboring matches within 40 residues

    GTQITVEDLFYNI SEI YYN

    ATCGCCATGCTTAATTGGGCTT CATGCTTAATT

    neighborhood words

    exact word match one match

    two matches

  • Some Flavors of BLAST ucleotide rotein N

    NN

    N

    N

    N

    P

    P

    blastx

    tblastn

    tblastx

    P P P P P P

    P P P P P P P P P P P P

    P P P P P P Query Database Program

    blastp

    blastn

    P P N N

    PP

  • BLAST Selection Matrix

  • Nucleotide vs. Protein BLAST

    aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc H.sapiens: N R V T V V L G A Q W G D E G + + V + V L G Q W G D E G A.thaliana: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt

    Comparing ADSS from H. sapiens and A. thaliana

    BLASTp finds three matching words BLASTn finds no match, because there are no 7 bp words

    Protein searches are generally more sensitive than nucleotide searches.

  • Choosing The Right BLAST 
Flavor for Proteins

    What you Want to Do? The Right BLAST Flavor Find out something about the function of the protein

    Use blastp to compare your protein with other proteins contained in the databases.

    Discover new genes encoding similar proteins

    Use tblastn to compare your protein with DNA sequences translated into their 6 possible reading frames

    Claverie & Notredame 2003

  • Choosing the Right BLAST 
Flavor for DNA

    Questions Answer Am I interested in non coding DNA?

    Yes, Use blastn. Rem: blastn is only for closely related DNA sequences (more than 70% identical)

    Do I want to discover new proteins?

    Yes, Use tblastx

    Do I want to discover proteins encoded in my query DNA sequences?

    Yes, Use blastx

    Am I unsure of the quality of my DNA?

    Yes, Use blastx. Especially if you suspsect your DNA sequence codes for a protein, but may contain sequencing errors.

    Claverie & Notredame 2003

  • Choosing The Right BLAST Flavor 
for DNA Sequences

    Usage Query Database Program Find very similar DNA sequence

    DNA DNA blastn

    Protein discovery and ESTs

    Translated DNA

    Translated DNA

    tblastx

    Analysis of query DNA sequence

    Translated DNA

    Protein blastx

    Claverie & Notredame 2003

  • Some WWW-BLAST Databases

    •  nr (nt) –  Traditional gb divisions –  NM_ and XM_ RefSeqs

    •  est •  htgs •  gss •  wgs •  chromosome

    –  NC_ RefSeqs

    •  env_nt •  month

    •  nr (non-redundant sequences) –  GenBank CDS translations –  NP_ RefSeqs –  PIR, Swiss-Prot, PRF –  PDB (sequences from structures)

    •  swissprot

    •  pat - patents

    •  pdb - sequences with 3D structures

    •  env_nr - environmental samples

    •  month - sequences updated within the past 30 days

    Nucleotide Protein

  • Scoring Systems - Nucleotides

    A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1

    Identity matrix

    CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA

  • Scoring Systems - Proteins Position Independent Matrices

    PAM Matrices (Percent Accepted Mutation) •  Derived from observation; small dataset of alignments •  Implicit model of evolution •  All calculated from PAM1 •  PAM250 widely used

    BLOSUM Matrices (BLOck SUbstitution Matrices) •  Derived from observation; large dataset of highly conserved

    blocks •  Each matrix derived separately from blocks with a defined

    percent identity cutoff •  BLOSUM62 - default matrix for BLAST

    Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST

  • PAM250-Matrix A R N D C Q E G H I L K M F P S T W Y V

    •  A 2 •  R -2 6 •  N 0 0 2 •  D 0 -1 2 4 •  C -2 -4 -4 -5 12 •  Q 0 1 1 2 -5 4 •  E 0 -1 1 3 -5 2 4 •  G 1 -3 0 1 -3 -1 0 5 •  H -1 2 2 1 -3 3 1 -2 6 •  I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 •  L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 •  K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 •  M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 •  F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 •  P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 •  S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 •  T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 •  W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 •  Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 •  V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

    Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of Protein Sequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation, Silver Spring, MD.

  • BLOSUM62 Substitution Matrix

    Common amino acids have low weights

    A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

    Rare amino acids have high weights

    Negative for less likely substitutions

    Positive for more likely substitutions

  • Lower BLOSUM series means more divergence

    Higher PAM series means more divergence

    better for finding local alignments

    better for finding global alignments and remote homologs

    based on groups of related sequences counted as one

    based on minimum replacement or maximum parsimony

    Built from vast amout of data

    Built from small amout of data

    Built from local alignments

    Built from global alignments

    BLOSUM PAM

    Matrix differences

  • Matrices - Rules of thumb

    Need different levels of sensitivity ? – Close relationships (Low PAM number

    (PAM 1) or high Blosum number, eg. 80) – Distant relationships (High PAM (e.g. PAM

    250), low Blosum (BLOSUM 45)

  • Local Alignment Statistics High scores of local alignments between two random

    Sequences follow the Extreme Value Distribution.

    Score

    Alig

    nmen

    ts

    (applies to ungapped alignments)

    E = Kmne-λS E = mn2-S’

    K = scale for search space λ = scale for scoring system S’ = bitscore = (λS - lnK)/ln2

    Expect Value E = number of database hits you expect to find by chance

    size of database

    your score

    expected number of random hits

    http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

  • Protein BLAST page

    >sorting nexin 18 [Homo sapiens] MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASVQVIRAPEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPAS

    Choose your database

    Accession, GI, or sequence

  • Advanced BLAST Options

    Matrix Selection (protein) • PAM30 -- most stringent • BLOSUM45 -- least stringent

    Example Entrez Queries green plants[Organism] srcdb_refseq[Properties] biomol_mrna[Properties] proteins_all[Filter] NOT mammalia[Organism]

    Other Advanced -e 10000 expect value -v 2000 descriptions -b 2000 alignments

    Limit by taxon Mus musculus[Organism] Mammalia[Organism] Viridiplantae[Organism] Expect Value Cut-off

  • sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%) Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88

    Filtered Unfiltered

    Low Complexity Filtering

  • Homology Search: Sorting Nexin 18 >Notch homolog 3 [Homo sapiens] MGPGARGRRRRRRPMSPPPPPPPVRALPLLLLLAGPGAAAPPCLDGSPCANGGRCTQLPSREAACLCPPGWVGERCQLEDPCHSGPCAGRGVCQSSVVAGTARFSCRCPRGFRGPDCSLPDPCLSSPCAHGARCSVGPDGRFLCSCPPGYQGRSCRSDVDEC

    >sorting nexin 18 [Homo sapiens] MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASVQVIRAPEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPAS

  • BLAST Formatting Page

    Results of a Conserved Domain Search

  • >gi|20140146|sp|Q96RF0|SNXI_HUMAN Sorting nexin 18 Length = 628 Score = 1048 bits (2710), Expect = 0.0 Identities = 528/628 (84%), Positives = 528/628 (84%) Query: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR Sbjct: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 Query: 61 XXXXXXXXXXXXXXXXXXXNVPPGGFEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSTFQ 120 NVPPGGFE STFQ Sbjct: 61 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ 120 . . .

    Low Complexity Filter

    low complexity sequence

    Query: 61 apepgpagdggpgaparyaNVPPGGFEplpvappasfkpppdafqallqpqqapppSTFQ 120 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ Sbjct: 61 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ 120

  • BLAST Formatting Page

  • BLAST Output Overview • Graphic Display: Shows you where your

    query is similar to other sequences. • Hit List: Name of sequences similar to

    your query ranked by similarity • Alignments: Every alignment between

    your query and the reported hits • Parameters: List of the various

    parameters used for the search

  • BLAST Output: Graphic Overview

    SH3 PX

    Results of the CD Search:

    Red bar = most similar sequence Pink = almost as similar Green – even less similar Blue/Black – worse scores

    mouse over, click for active links

  • BLAST Output: Descriptions

    Links to Sequence Records

    4 X 10-68

    Expect Value Cut-off default = 10

    L

    G

    U

    S

    LocusLink

    UniGene

    GEO

    Structure

    Bit scores < 50 unreliable

    link to entrez

  • A Little on Interpretation • How similar must sequences be in order

    to be considered homologous? • More than 25% of the amino acids

    present are identical for proteins and more than 70% of the nucleotides present are identical for DNA. Above these limits, you can be sure that two proteins have same structure and same common ancestor.

    • Rem: only > 100 aa or nt in length

  • A Little on Interpretation: E-value •  Determine how much you can trust your

    conclusion on homology. •  E-value = Expectation Values •  Allow for comparing pairwise alignment with

    different similarities and different length. Advantage over Percent Identity (not discussed).

    •  Definition: Number of times your database match may have occurred by chance. Match unlikely to occur by chance is a good match. The lowest E-values (as close to 0 as possible) are the best. Thus, most significant, since we know we can trust them enough to infer homology

    •  If you want to be certain of homology your E-values must be below 10-4 or (0.0001).

  • Results From “nr” (protein)

    non-redundant set

  • BLAST Output: Pairwise Alignments

    >gi|12643956|sp|Q9Y5X1|SNX9_HUMAN Sorting nexin 9 (SH3 and PX domain-

    containing protein 1) (SDP1 protein) Length = 595

    Score = 255 bits (652), Expect = 4e-68

    Identities = 140/322 (43%), Positives = 185/322 (56%), Gaps = 7/322 (2%)

    Query: 221 SSATVSRNLNRFSTFVKSGGEAFVLGEASGFVKDGDKLCVVLGPYGPEWQENPYPFQCTI 280

    SS+++ LN+F F K G E ++L A K +K+ +++G YGP W F C +

    Sbjct: 197 SSSSMKIPLNKFPGFAKPGTEQYLL--AKQLAKPKEKIPIIVGDYGPMWVYPTSTFDCVV 254

    Query: 281 DDPTKQTKFKGMKSYISYKLVPTHTQVPVHRRYKHFDWLYARLAEKF-PVISVPHLPEKQ 339

    DP K +K G+KSYI Y+L PT+T V+ RYKHFDWLY RL KF I +P LP+KQ

    Sbjct: 255 ADPRKGSKMYGLKSYIEYQLTPTNTNRSVNHRYKHFDWLYERLLVKFGSAIPIPSLPDKQ 314

    Query: 340 ATGRFEEDFISKRRKGLIWWMNHMASHPVLAQCDVFQHFLTCPSSTDEKAWKQGKRKAEK 399

    TGRFEE+FI R + L WM M HPV+++ +VFQ FL + DEK WK GKRKAE+

    Sbjct: 315 VTGRFEEEFIKMRMERLQAWMTRMCRHPVISESEVFQQFL---NFRDEKEWKTGKRKAER 371

    VH V+ VN

  • BLAST Output: Alignments >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756 Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%) Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335 Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDA Sbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395 Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSS Sbjct: 396 FLQPLSKPLSS 406

    low complexity sequence filtered

  • Genomic BLAST

    •  The BLAST homepage links to the Genome BLAST pages provide customized nucleotide and protein databases for each genome.

    •  If a Map Viewer is available, the BLAST hits can be viewed on the maps.

  • Identify an your sequence with BLAST

    BG743989

    Human EST from a Natural Killer Cell Culture:

    DEFINITION: 602722761F1 NIH_MGC_106 Homo sapiens cDNA clone IMAGE:4849239 5', mRNA sequence.

  • BLAST Tips • It is faster and more accurate to BLAST

    proteins (blastp) rather than nucleotides. • If in doubt use blastp. • When possible restrict to the subset of

    the database you are interested in. • Look around for the database you need or

    create your own custom BLAST database. BUT HOW???

    • When is the best time to use the BLAST server?