ncbi molecular biology resourceshpc.ilri.cgiar.org/beca/training/imbb/lectures/ncbi_2013... ·...

NCBI Molecular Biology Resources

May 16, 2012

BLAST

Sequence VAST

Structure Entrez

Text

Searching the NCBI Databases

http://www.ncbi.nlm.nih.gov/blast

Why do we need  sequence similarity searching?

•  To identify and annotate sequences with… −  incomplete (or no) annotations (GenBank) −  incorrect annotations

•  To assemble genomes •  To explore evolutionary relationships by… −  finding homologous molecules −  developing phylogenetic trees

NOTE: Similar sequences may NOT have similar function! NOTE: Homologous molecules may have similar functions.

Searching with Sequences

Why Search Databases?

• To find out if a new DNA sequence is already fully or partially present in the databanks.

• To find homologous proteins to a putative coding ORF that might share similar 3D structure.

• to identify homology (“relatedness”) between a query and entries in a database

3000 Myr

1000 Myr

540 Myr

Common ancestry allows us to infer similar function

Alzheimer’s Disease

Ataxia telangiectasia

Colon cancer

Pancreatic carcinoma

Yeast Bacteria Worm Fly Human

Molecular Evolution

MLH1 MutL

Some Terminology

Searching Sequence Databases • Two sequences are homologous when

they share a common ancestry. This ancestry is reflected in strong sequence similarity.

• Computationally, threshold limits for sequence similarity can be defined by : –  length of the stretch of similar sequence – percentage of identity between the

sequence –  statistical measurements, like E-value, P-

value, Bit-score, etc.

Similarity and Homology

•  Similarity can be expressed as a percentage. It does not imply any reasons for the observed sameness.

• Homology is an evolutionary term used to describe relationship via descent from a common ancestor.

• Homologous things are often similar, but not always (whale flipper human arm)

• Homology is NEVER expressed as a percentage

Orthologs vs Paralogs • Homologs can be separated into

two classes: orthologs and paralogs.

• Orthologs are homologous genes that perform the same function in different species.

• Paralogs are homologous genes within a species that may perform different functions.

Similarity and Homology •  Sequence homology can be reliably inferred

from statistically significant similarity over a majority of the sequence length.

•  Non-homology CANNOT be inferred from non-similarity because non-similar things can still share a common ancestor.

•  Homologous proteins share common structures, but not necessarily common sequence or function (e.g. FtsZ tubulin)

•  Remember: pair of sequences either is or isn't homologous. There is no such thing as “64% homologous"

Searching sequence databases

• When we search a sequence database, we are usually looking for related sequences.

• Unfortunately, the algorithms that we have for searching databases, do not search for homology, they search for similarity.

• When similarity is found, we must determine if this similarity is a result of homology or it if comes from another source.

Substitutions, Insertions, Deletions

•  Mutation: one of –  switch from one nucleotide to another –  insertion –  deletion

•  Substitution: a switch in nucleotides which spreads throughout most of a species.

•  Substitutions, insertions and deletions passed along two independent lines of descent cause a divergence of the two sequences from the original (and from each other):

ccctaggtccca!

cgggtatccaa!cggtatgcca!

Example

• For the previous example cggtatgcca→ cgggtatccaa , ccctaggtccca, the two

descendent sequences align as follows c g g g t a - - t - c c a a c c c - t a g g t c c c - a • “-” (indel) represents an insertion or

deletion.

Pairwise Sequence Alignments

•  Purpose: •  identification of sequences with significant similarity to (a)

sequence(s) in a sequence-repository •  identification of all homologous sequences the repository •  identification of domains with sequence similarity

• Terminology •  Global alignment •  Local alignment

Terminology: Global Alignment

• Finds the optimal alignment over the

entire length of the two compared sequences

• Unlikely to detect genes that have evolved by recombination (e.g. domain shuffling) or insertion/deletion of DNA

• Suitable for sequences of homologous molecules

Terminology: Local Alignment

• short regions of similarity between a pair of sequences.

• compared sequences can receive high local similarity scores, without the need to have high levels of similarity over their entire length

• useful when looking for domains within proteins or looking for regions of genomic DNA that contain coding exons

Algorithms

• Needleman-Wunsch –  Exhaustive global alignment –  most rigorous method when aligning conserved

sequences of similar length (no exon shuffling, insertion/deletion etc)

•  Smith-Waterman –  Exhaustive local alignment –  alignment does not have to extend along the full

length of the sequences –  In contrast to N-W alignments initiating at all

possible positions of the sequence-space will be considered

–  Can be very slow

Basic Local Alignment Search Tool

•  Calculates similarity for biological sequences •  Finds best local alignments •  A Heuristic approach based on the

Smith-Waterman algorithm − Searches for matching “words” rather than individual

residues − Uses statistical theory to determine if a match might

have occurred by chance

Local vs. Global Alignment

Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492

human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60

440 450 human REQLEHI--------KTHELHL . .:: . : ... worm QWKLEDLFNLDSSEYKEASINF 500

Align program (Lipman and Pearson) -a global alignment protocol-

BLASTp: protein-protein comparison -a local alignment protocol-

Nucleotide Words GTACTGGACATGGACCCTACAGGAA Query:

Word Size = 11 GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT ...........

Make a lookup table of words

Minimum word size = 7 blastn default = 11 megablast default = 28

An alignment that BLAST can’t find

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Here there are no words longer than 6…

...for nucleotides there must be an exact match of at least 7.

Protein Words GTQITVEDLFYNIATRRKALKN Query:

Word Size = 3

Neighborhood Words LTV, MTV, ISV, LSV, etc.

GTQ TQI QIT ITV TVE VED EDL DLF ...

Make a lookup table of words

Word Size can be 2 or 3 (default = 3)

Minimum Requirements for a Hit

•  Nucleotide BLAST requires one exact match •  Protein BLAST requires two neighboring matches within 40 residues

GTQITVEDLFYNI SEI YYN

ATCGCCATGCTTAATTGGGCTT CATGCTTAATT

neighborhood words

exact word match one match

two matches

Some Flavors of BLAST ucleotide rotein N

NN

N

N

N

P

P

blastx

tblastn

tblastx

P P P P P P

P P P P P P P P P P P P

P P P P P P Query Database Program

blastp

blastn

P P N N

PP

BLAST Selection Matrix

Nucleotide vs. Protein BLAST

aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc H.sapiens: N R V T V V L G A Q W G D E G + + V + V L G Q W G D E G A.thaliana: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt

Comparing ADSS from H. sapiens and A. thaliana

BLASTp finds three matching words BLASTn finds no match, because there are no 7 bp words

Protein searches are generally more sensitive than nucleotide searches.

Choosing The Right BLAST  Flavor for Proteins

What you Want to Do? The Right BLAST Flavor Find out something about the function of the protein

Use blastp to compare your protein with other proteins contained in the databases.

Discover new genes encoding similar proteins

Use tblastn to compare your protein with DNA sequences translated into their 6 possible reading frames

Claverie & Notredame 2003

Choosing the Right BLAST  Flavor for DNA

Questions Answer Am I interested in non coding DNA?

Yes, Use blastn. Rem: blastn is only for closely related DNA sequences (more than 70% identical)

Do I want to discover new proteins?

Yes, Use tblastx

Do I want to discover proteins encoded in my query DNA sequences?

Yes, Use blastx

Am I unsure of the quality of my DNA?

Yes, Use blastx. Especially if you suspsect your DNA sequence codes for a protein, but may contain sequencing errors.


Choosing The Right BLAST Flavor  for DNA Sequences

Usage Query Database Program Find very similar DNA sequence

DNA DNA blastn

Protein discovery and ESTs

Translated DNA

Translated DNA

tblastx

Analysis of query DNA sequence

Translated DNA

Protein blastx


Some WWW-BLAST Databases

•  nr (nt) –  Traditional gb divisions –  NM_ and XM_ RefSeqs

•  est •  htgs •  gss •  wgs •  chromosome

–  NC_ RefSeqs

•  env_nt •  month

•  nr (non-redundant sequences) –  GenBank CDS translations –  NP_ RefSeqs –  PIR, Swiss-Prot, PRF –  PDB (sequences from structures)

•  swissprot

•  pat - patents

•  pdb - sequences with 3D structures

•  env_nr - environmental samples

•  month - sequences updated within the past 30 days

Nucleotide Protein

Scoring Systems - Nucleotides

A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1

Identity matrix

CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA

Scoring Systems - Proteins Position Independent Matrices

PAM Matrices (Percent Accepted Mutation) •  Derived from observation; small dataset of alignments •  Implicit model of evolution •  All calculated from PAM1 •  PAM250 widely used

BLOSUM Matrices (BLOck SUbstitution Matrices) •  Derived from observation; large dataset of highly conserved

blocks •  Each matrix derived separately from blocks with a defined

percent identity cutoff •  BLOSUM62 - default matrix for BLAST

Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST

PAM250-Matrix A R N D C Q E G H I L K M F P S T W Y V

•  A 2 •  R -2 6 •  N 0 0 2 •  D 0 -1 2 4 •  C -2 -4 -4 -5 12 •  Q 0 1 1 2 -5 4 •  E 0 -1 1 3 -5 2 4 •  G 1 -3 0 1 -3 -1 0 5 •  H -1 2 2 1 -3 3 1 -2 6 •  I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 •  L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 •  K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 •  M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 •  F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 •  P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 •  S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 •  T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 •  W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 •  Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 •  V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of Protein Sequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation, Silver Spring, MD.

BLOSUM62 Substitution Matrix

Common amino acids have low weights

A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

Lower BLOSUM series means more divergence

Higher PAM series means more divergence

better for finding local alignments

better for finding global alignments and remote homologs

based on groups of related sequences counted as one

based on minimum replacement or maximum parsimony

Built from vast amout of data

Built from small amout of data

Built from local alignments

Built from global alignments

BLOSUM PAM

Matrix differences

Matrices - Rules of thumb

Need different levels of sensitivity ? – Close relationships (Low PAM number

(PAM 1) or high Blosum number, eg. 80) – Distant relationships (High PAM (e.g. PAM

250), low Blosum (BLOSUM 45)

Local Alignment Statistics High scores of local alignments between two random

Sequences follow the Extreme Value Distribution.

Score

Alig

nmen

ts

(applies to ungapped alignments)

E = Kmne-λS E = mn2-S’

K = scale for search space λ = scale for scoring system S’ = bitscore = (λS - lnK)/ln2

Expect Value E = number of database hits you expect to find by chance

size of database

your score

expected number of random hits

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Protein BLAST page

>sorting nexin 18 [Homo sapiens] MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASVQVIRAPEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPAS

Choose your database

Accession, GI, or sequence

Advanced BLAST Options

Matrix Selection (protein) • PAM30 -- most stringent • BLOSUM45 -- least stringent

Example Entrez Queries green plants[Organism] srcdb_refseq[Properties] biomol_mrna[Properties] proteins_all[Filter] NOT mammalia[Organism]

Other Advanced -e 10000 expect value -v 2000 descriptions -b 2000 alignments

Limit by taxon Mus musculus[Organism] Mammalia[Organism] Viridiplantae[Organism] Expect Value Cut-off

sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%) Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88

Filtered Unfiltered

Low Complexity Filtering

Homology Search: Sorting Nexin 18 >Notch homolog 3 [Homo sapiens] MGPGARGRRRRRRPMSPPPPPPPVRALPLLLLLAGPGAAAPPCLDGSPCANGGRCTQLPSREAACLCPPGWVGERCQLEDPCHSGPCAGRGVCQSSVVAGTARFSCRCPRGFRGPDCSLPDPCLSSPCAHGARCSVGPDGRFLCSCPPGYQGRSCRSDVDEC

>sorting nexin 18 [Homo sapiens] MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASVQVIRAPEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPAS

BLAST Formatting Page

Results of a Conserved Domain Search

>gi|20140146|sp|Q96RF0|SNXI_HUMAN Sorting nexin 18 Length = 628 Score = 1048 bits (2710), Expect = 0.0 Identities = 528/628 (84%), Positives = 528/628 (84%) Query: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR Sbjct: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 Query: 61 XXXXXXXXXXXXXXXXXXXNVPPGGFEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSTFQ 120 NVPPGGFE STFQ Sbjct: 61 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ 120 . . .

Low Complexity Filter

low complexity sequence

Query: 61 apepgpagdggpgaparyaNVPPGGFEplpvappasfkpppdafqallqpqqapppSTFQ 120 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ Sbjct: 61 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ 120

BLAST Formatting Page

BLAST Output Overview • Graphic Display: Shows you where your

query is similar to other sequences. • Hit List: Name of sequences similar to

your query ranked by similarity • Alignments: Every alignment between

your query and the reported hits • Parameters: List of the various

parameters used for the search

BLAST Output: Graphic Overview

SH3 PX

Results of the CD Search:

Red bar = most similar sequence Pink = almost as similar Green – even less similar Blue/Black – worse scores

mouse over, click for active links

BLAST Output: Descriptions

Links to Sequence Records

4 X 10-68

Expect Value Cut-off default = 10

L

G

U

S

LocusLink

UniGene

GEO

Structure

Bit scores < 50 unreliable

link to entrez

A Little on Interpretation • How similar must sequences be in order

to be considered homologous? • More than 25% of the amino acids

present are identical for proteins and more than 70% of the nucleotides present are identical for DNA. Above these limits, you can be sure that two proteins have same structure and same common ancestor.

• Rem: only > 100 aa or nt in length

A Little on Interpretation: E-value •  Determine how much you can trust your

conclusion on homology. •  E-value = Expectation Values •  Allow for comparing pairwise alignment with

different similarities and different length. Advantage over Percent Identity (not discussed).

•  Definition: Number of times your database match may have occurred by chance. Match unlikely to occur by chance is a good match. The lowest E-values (as close to 0 as possible) are the best. Thus, most significant, since we know we can trust them enough to infer homology

•  If you want to be certain of homology your E-values must be below 10-4 or (0.0001).

Results From “nr” (protein)

non-redundant set

BLAST Output: Pairwise Alignments

>gi|12643956|sp|Q9Y5X1|SNX9_HUMAN Sorting nexin 9 (SH3 and PX domain-

containing protein 1) (SDP1 protein) Length = 595

Score = 255 bits (652), Expect = 4e-68

Identities = 140/322 (43%), Positives = 185/322 (56%), Gaps = 7/322 (2%)

Query: 221 SSATVSRNLNRFSTFVKSGGEAFVLGEASGFVKDGDKLCVVLGPYGPEWQENPYPFQCTI 280

SS+++ LN+F F K G E ++L A K +K+ +++G YGP W F C +

Sbjct: 197 SSSSMKIPLNKFPGFAKPGTEQYLL--AKQLAKPKEKIPIIVGDYGPMWVYPTSTFDCVV 254

Query: 281 DDPTKQTKFKGMKSYISYKLVPTHTQVPVHRRYKHFDWLYARLAEKF-PVISVPHLPEKQ 339

DP K +K G+KSYI Y+L PT+T V+ RYKHFDWLY RL KF I +P LP+KQ

Sbjct: 255 ADPRKGSKMYGLKSYIEYQLTPTNTNRSVNHRYKHFDWLYERLLVKFGSAIPIPSLPDKQ 314

Query: 340 ATGRFEEDFISKRRKGLIWWMNHMASHPVLAQCDVFQHFLTCPSSTDEKAWKQGKRKAEK 399

TGRFEE+FI R + L WM M HPV+++ +VFQ FL + DEK WK GKRKAE+

Sbjct: 315 VTGRFEEEFIKMRMERLQAWMTRMCRHPVISESEVFQQFL---NFRDEKEWKTGKRKAER 371

VH V+ VN

BLAST Output: Alignments >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756 Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%) Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335 Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDA Sbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395 Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSS Sbjct: 396 FLQPLSKPLSS 406

low complexity sequence filtered

Genomic BLAST

•  The BLAST homepage links to the Genome BLAST pages provide customized nucleotide and protein databases for each genome.

•  If a Map Viewer is available, the BLAST hits can be viewed on the maps.

Identify an your sequence with BLAST

BG743989

Human EST from a Natural Killer Cell Culture:

DEFINITION: 602722761F1 NIH_MGC_106 Homo sapiens cDNA clone IMAGE:4849239 5', mRNA sequence.

BLAST Tips • It is faster and more accurate to BLAST

proteins (blastp) rather than nucleotides. • If in doubt use blastp. • When possible restrict to the subset of

the database you are interested in. • Look around for the database you need or

create your own custom BLAST database. BUT HOW???

• When is the best time to use the BLAST server?

ncbi molecular biology resourceshpc.ilri.cgiar.org/beca/training/imbb/lectures/ncbi_2013... ·...

Documents