ncbi fieldguide ncbi molecular biology resources january 2008 peter cooper using ncbi blast
TRANSCRIPT
NC
BI
Fie
ldG
uid
e
NCBI Molecular Biology Resources
January 2008 Peter Cooper
Using NCBI BLAST
NC
BI
Fie
ldG
uid
eBasic Local Alignment Search
Tool
• Widely used similarity search tool• Heuristic approach based on Smith Waterman algorithm• Finds best local alignments• Provides statistical significance• All combinations (DNA/Protein) query and database.
– DNA vs DNA
– DNA translation vs Protein
– Protein vs Protein
– Protein vs DNA translation
– DNA translation vs DNA translation
• www, standalone, and network client
NC
BI
Fie
ldG
uid
e
What BLAST tells you
• BLAST reports surprising alignments– Different than chance
• Assumptions– Random sequences– Constant composition
• Conclusions– Surprising similarities imply evolutionary
homology
Evolutionary Homology: descent from a common ancestorDoes not always imply similar function
NC
BI
Fie
ldG
uid
e
BLAST and BLAST-like programs• Traditional BLAST (blastall) nucleotide, protein, translations
– blastn nucleotide query vs. nucleotide database
– blastp protein query vs. protein database
– blastx nucleotide query vs. protein database
– tblastn protein query vs. translated nucleotide database
– tblastx translated query vs. translated database
• Megablast nucleotide only
– Contiguous megablast• Nearly identical sequences
– Discontiguous megablast • Cross-species comparison
• Position Specific BLAST Programs protein only
– Position Specific Iterative BLAST (PSI-BLAST)• Automatically generates a position specific score matrix (PSSM)
– Reverse PSI-BLAST (RPS-BLAST)• Searches a database of PSI-BLAST PSSMs
NC
BI
Fie
ldG
uid
e
Local Alignment StatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution
Score
Alig
nm
en
ts
(applies to ungapped alignments)
E = Kmne-S or E = mn2-S’
K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2
Expect Value
E = number of database hits you expect to find by chance
size of database
your score
expected number of random hits
NC
BI
Fie
ldG
uid
e
Scoring Systems
•Position Independent MatricesPosition Independent Matrices•Nucleic Acids – identity matrix•Proteins
•PAM Matrices (Percent Accepted Mutation)•Implicit model of evolution•Higher PAM number all calculated from PAM1•PAM250 widely used
•BLOSUM Matrices (BLOck SUbstitution Matrices)•Empirically determined from alignment of conserved blocks•Each includes information up to a certain level of identity•BLOSUM62 widely used
•Position Specific Score Matrices (PSSMs)Position Specific Score Matrices (PSSMs)•PSI and RPS BLAST
NC
BI
Fie
ldG
uid
e
WWW BLAST Interface
NC
BI
Fie
ldG
uid
eThe BLAST homepage
www.ncbi.nlm.nih.gov/blastwww.ncbi.nlm.nih.gov/blast
NC
BI
Fie
ldG
uid
e
Basic BLAST: Databases
NC
BI
Fie
ldG
uid
e
BLAST Databases: Non-redundant protein
nr (non-redundant protein sequences)– GenBank CDS translations– NP_, XP_ RefSeqs– Outside Protein
• PIR, Swiss-Prot, PRF• PDB (sequences from structures)
pat protein patents
env_nr environmental samples
nr (non-redundant protein sequences)– GenBank CDS translations– NP_, XP_ RefSeqs– Outside Protein
• PIR, Swiss-Prot, PRF• PDB (sequences from structures)
pat protein patents
env_nr environmental samples
Servicesblastpblastx
NC
BI
Fie
ldG
uid
e
Nucleotide Databases: Human and Mouse
• Human and mouse genomic and transcript now default• Separate sections in output for mRNA and genomic• Direct links to Map Viewer for genomic sequences
Megablast, blastn service
NC
BI
Fie
ldG
uid
e
Nucleotide Databases: Traditional
Servicesblastntblastntblastx
NC
BI
Fie
ldG
uid
e
Nucleotide Databases: Traditional
• nr (nt)– Traditional GenBank– NM_ and XM_
RefSeqs• refseq_rna
• refseq_genomic– NC_ RefSeqs
• dbest – EST Division
• est_human, mouse, others
• htgs – HTG division
• gss – GSS division
• wgs– whole genome
shotgun
• env_nt– environmental
samples
Databases are mostly non-overlapping
NC
BI
Fie
ldG
uid
e
Basic BLAST: Protein Searches
NC
BI
Fie
ldG
uid
e
Universal Form: Protein
NC
BI
Fie
ldG
uid
e
3000 Myr3000 Myr
1000 Myr1000 Myr
540 Myr540 Myr
Alzheimer’sDisease
Ataxiatelangiectasia
Colon cancer
Pancreaticcarcinoma
Yeast BacteriaWormFlyHuman
BLAST and Molecular Evolution
MLH1 MutL
NC
BI
Fie
ldG
uid
e
Protein BLAST Page
NC
BI
Fie
ldG
uid
e
Limiting Database: Organism
Organism autocompleteOrganism autocomplete
NC
BI
Fie
ldG
uid
e
Limiting Database: Entrez Query
all[filter] NOT mammals[organism]
gene_in_mitochondrion[Properties]2006:2007 [Modification Date]
Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]
all[filter] NOT mammals[organism]
gene_in_mitochondrion[Properties]2006:2007 [Modification Date]
Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]
NC
BI
Fie
ldG
uid
e
Run Search
NC
BI
Fie
ldG
uid
e
BLAST Formatting Page
Conserved Domain ResultsConserved Domain Results
NC
BI
Fie
ldG
uid
eBLAST Output: Graphical
Overview
mouse overmouse over
Sort by taxonomySort by taxonomy
NC
BI
Fie
ldG
uid
e
BLAST Output: Descriptions
Link to entrezLink to entrez
Sorted by e valuesSorted by e values
5 X 10-145 X 10-14
Default e value cutoff 10Default e value cutoff 10
Gene LinkoutGene Linkout
NC
BI
Fie
ldG
uid
e
TaxBLAST: Taxonomy Reports
NC
BI
Fie
ldG
uid
e
BLAST Output: Alignments
Identical matchIdentical match
positive score(conservative)positive score(conservative)
Negative or zeroNegative or zero
gapgap
NC
BI
Fie
ldG
uid
e
Position Specific Iterative BLAST
NC
BI
Fie
ldG
uid
e
MLH1 and ETR1>gi|4557757|ref|NP_000240.1| MutL protein homolog 1 [Homo sapiens] MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIVKEGGLKLIQIQDNGTGIRK EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPK PCAGNQGTQITVEDLFYNIATRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNA STVDNIRSIFGNAVSRELIEIGCEDKTLAFKMNGYISNANYSVKKCIFLLFINHRLVESTSLRKAIETVY AAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLGSNSSRMYFTQTLLP GLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSSQPQAIVTEDKTDIS SGRARQQDEEMLELPAPAEVAAKNQSLEGDTTKGTSEMSEKRGPTSSNPRKRHREDSDVEMVEDDSRKEM TAACTPRRRIINLTSVLSLQEEINEQGHEVLREMLHNHSFVGCVNPQWALAQHQTKLYLLNTTKLSEELF YQILIYDFANFGVLRLSEPAPLFDLAMLALDSPESGWTEEDGPKEGLAEYIVEFLKKKAEMLADYFSLEI DEEGNLIGLPLLIDNYVPPLEGLPIFILRLATEVNWDEEKECFESLSKECAMFYSIRKQYISEESTLSGQQSEVPGSIPNSWKWTVEHIVYKALRSHILPPKHFTEDGNILQLANLPDLYKVFERC
>gi|22095656|sp|O81122.1|ETR1_MALDO Ethylene receptorMLACNCIEPQWPADELLMKYQYISDFFIALAYFSIPLELIYFVKKSAVFPYRWVLVQFGAFIVLCGATHLINLWTFSIHSRTVAMVMTTAKVLTAVVSCATALMLVHIIPDLLSVKTRELFLKNKAAELDREMGLIRTQEETGRHVRMLTHEIRSTLDRHTILKTTLVELGRTLALEECALWMPTRTGLELQLSYTLRQQNPVGYTVPIHLPVINQVFSSNRAVKISANSPVAKLRQLAGRHIPGEVVAVRVPLLHLSNFQINDWPELSTKRYALMVLMLPSDSARQWHVHELELVEVVADQVAVALSHAAILEESMRARDLLMEQNIALDLARREAETAIRARNDFLAVMNHEMRTPMHAIIALSSLLQETELTAEQRLMVETILRSSNLLATLINDVLDLSRLEDGSLQLEIATFNLHSVFREVHNMIKPVASIKRLSVTLNIAADLPMYAIGDEKRLMQTILNVVGNAVKFSKEGSISITAFVAKSESLRDFRAPDFFPVQSDNHFYLRVQVKDSGSGINPQDIPKLFTKFAQTQALATRNSGGSGLGLAICKRFVNLMEGHIWIESEGLGKGCTATFIVKLGFPERSNESKLPFAPKLQANHVQTNFPGLKVLVMDDNGVSRSVTKGLLAHLGCDVTAVSLIDELLHVISQEHKVVFMDVSMPGIDGYELAVRIHEKFTKRHERPVLVALTGSIDKITKENCMRVGVDGVILKPVSVDKMRSVLSELLEHRVLFEAM
Human Mismatch Repair ProteinHuman Mismatch Repair Protein
Apple ethylene receptorApple ethylene receptor
NC
BI
Fie
ldG
uid
e
PSI-BLAST: Iteration 1
NC
BI
Fie
ldG
uid
e
PSI-BLAST:Iteration 4
Plant ethylene receptors, bacterial two-component regulatory system kinasesPlant ethylene receptors, bacterial two-component regulatory system kinases
NC
BI
Fie
ldG
uid
e
RPS-BLAST: Conserved Domains
Histidine Kinase-like ATPase DomainHistidine Kinase-like ATPase Domain
NC
BI
Fie
ldG
uid
eAlgorithm parameters: Protein
Adjust to set stringencyAdjust to set stringency
May limit resultsMay limit results
Default statistics adjustmentfor compositional bias
Default statistics adjustmentfor compositional bias
Off now by default. Conflicts withcomp-based statsOff now by default. Conflicts withcomp-based stats
ExpandExpand
NC
BI
Fie
ldG
uid
e
Automatic Short Sequence Adjustment
e-value 20000Word Size 2Matrix PAM30Comp Stats OffLow Comp Filter Off
Nucleotide and Protein
NC
BI
Fie
ldG
uid
e
Basic BLAST: Nucleotide
NC
BI
Fie
ldG
uid
e
Universal Form: Nucleotide
SpeedSpeedSensitivitySensitivity
More
Less
Less
More
NC
BI
Fie
ldG
uid
e
Nucleotide Results: ALB mRNA
megablastmegablast
disco. megablastdisco. megablast
blastnblastn
NC
BI
Fie
ldG
uid
eNucleotide BLAST: Human
Genome
NC
BI
Fie
ldG
uid
e
Sortable Results
Pseudogene on Chromosome 9Pseudogene on Chromosome 9
Separate Sections for
Transcript and Genome
Separate Sections for
Transcript and Genome
Direct links to Entrez DatabasesDirect links to Entrez Databases
Functional Gene on Chromosome 1Functional Gene on Chromosome 1
NC
BI
Fie
ldG
uid
e
Total Score: All Segments
Functional Gene Now FirstFunctional Gene Now First
NC
BI
Fie
ldG
uid
e
Alignments: Sorting in Exon Order
Default Sorting Order: ScoreLongest exon usually firstDefault Sorting Order: ScoreLongest exon usually first
Query start positionExon orderQuery start positionExon order
NC
BI
Fie
ldG
uid
e
Links to Map Viewer
Chromosome 1 Chromosome 9
NC
BI
Fie
ldG
uid
e
Algorithm parameters: Nucleotideblastnblastn
•Masks species-specific interspersed repeats•Essential for genomic query sequences•Masks species-specific interspersed repeats•Essential for genomic query sequences
•Prevents starting alignment in masked region•Allows extensions through masked regions•Prevents starting alignment in masked region•Allows extensions through masked regions
Masks LC sequence (simple repeats)Masks LC sequence (simple repeats)
NC
BI
Fie
ldG
uid
e
BLAST Formatting Options
NC
BI
Fie
ldG
uid
e
Protein Formatting Page
ShowAlignmentPSSMPssmWithParametersBioseq
ShowAlignmentPSSMPssmWithParametersBioseq
asHTMLPlain TextASN.1XML
asHTMLPlain TextASN.1XML
Alignment ViewPairwisePairwise with dots for identitiesQuery-anchored with dots for identitiesQuery-anchored with letters for identitiesFlat query-anchored with dots for identitiesFlat-query anchored with letters for identitiesHit table
Alignment ViewPairwisePairwise with dots for identitiesQuery-anchored with dots for identitiesQuery-anchored with letters for identitiesFlat query-anchored with dots for identitiesFlat-query anchored with letters for identitiesHit table
NC
BI
Fie
ldG
uid
e
Structured formats: XML and ASN.1
<Iteration_hits>−<Hit><Hit_num>1</Hit_num><Hit_id>gi|730028|sp|P40692|MLH1_HUMAN</Hit_id>−<Hit_def>DNA mismatch repair protein Mlh1 (MutL protein homolog 1)</Hit_def><Hit_accession>P40692</Hit_accession><Hit_len>756</Hit_len>−<Hit_hsps>−<Hsp><Hsp_num>1</Hsp_num><Hsp_bit-score>1568.9</Hsp_bit-score><Hsp_score>4061</Hsp_score><Hsp_evalue>0</Hsp_evalue><Hsp_query-from>1</Hsp_query-from><Hsp_query-to>756</Hsp_query-to><Hsp_hit-from>1</Hsp_hit-from><Hsp_hit-to>756</Hsp_hit-to><Hsp_query-frame>0</Hsp_query-frame><Hsp_hit-frame>0</Hsp_hit-frame><Hsp_identity>0</Hsp_identity><Hsp_positive>0</Hsp_positive><Hsp_gaps>0</Hsp_gaps><Hsp_align-len>756</Hsp_align-len>
Seq-annot ::= { desc { user { type str "Hist Seqalign" , data { { label str "Hist Seqalign" , data bool TRUE } } } , user { type str "Blast Type" , data { { label id 0 , data int 0 } } } , user { type str "BLAST database title" , data { { label str "Non-redundant SwissProt
Seq-annot ::= { desc { user { type str "Hist Seqalign" , data { { label str "Hist Seqalign" , data bool TRUE } } } , user { type str "Blast Type" , data { { label id 0 , data int 0 } } } , user { type str "BLAST database title" , data { { label str "Non-redundant SwissProt
XMLXML
ASN.1ASN.1
NC
BI
Fie
ldG
uid
e
The Hit Table# BLASTP 2.2.17 (Aug-26-2007)# Query: gi|4557757|ref|NP_000240.1| MutL protein homolog 1 [Homo sapiens]# Database: swissprot# Fields: query id, subject ids, % identity, % positives, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score# 80 hits foundref|NP_000240.1||gi|4557757 gi|1709056|sp|P38920|MLH1_YEAST 36.68 56.91 796 426 18 8 756 5 769 7e-138 491ref|NP_000240.1||gi|4557757 gi|48474996|sp|Q9P7W6|MLH1_SCHPO 37.24 54.04 768 371 16 8 756 9 684 8e-122 437ref|NP_000240.1||gi|4557757 gi|25090753|sp|Q8RA70|MUTL_THETN 37.44 54.62 390 231 7 8 394 4 383 5e-59 229ref|NP_000240.1||gi|4557757 gi|25090732|sp|Q8KAX3|MUTL_CHLTE 35.95 54.05 370 229 5 8 375 4 367 5e-55 215ref|NP_000240.1||gi|4557757 gi|127552|sp|P23367.2|MUTL_ECOLI 35.99 58.11 339 202 7 8 334 3 338 8e-55 214ref|NP_000240.1||gi|4557757 gi|29427778|sp|Q8FAK9|MUTL_ECOL6 35.99 58.11 339 202 7 8 334 3 338 1e-54 214ref|NP_000240.1||gi|4557757 gi|20455084|sp|Q8XDN4|MUTL_ECO57 35.99 58.11 339 202 7 8 334 3 338 1e-54 214ref|NP_000240.1||gi|4557757 gi|59798328|sp|Q72PF7|MUTL_LEPIC 36.27 55.20 375 221 8 6 375 2 363 3e-54 213ref|NP_000240.1||gi|4557757 gi|13431695|sp|P57886|MUTL_PASMU 35.48 58.94 341 213 6 8 345 3 339 4e-54 212ref|NP_000240.1||gi|4557757 gi|1171080|sp|P44494|MUTL_HAEIN 35.74 59.87 319 198 6 8 323 3 317 5e-54 212ref|NP_000240.1||gi|4557757 gi|20455102|sp|Q8ZIW4|MUTL_YERPE 36.01 58.63 336 207 6 8 339 3 334 6e-54 212ref|NP_000240.1||gi|4557757 gi|20455152|sp|Q9JYT2|MUTL_NEIMB 33.96 55.35 374 224 8 8 376 4 359 2e-53 210ref|NP_000240.1||gi|4557757 gi|20139217|sp|Q9KAC1|MUTL_BACHD 35.39 55.90 356 214 6 8 362 4 344 2e-53 209ref|NP_000240.1||gi|4557757 gi|31076794|sp|Q87L05|MUTL_VIBPA 35.33 58.38 334 210 5 8 338 3 333 3e-53 209ref|NP_000240.1||gi|4557757 gi|20455150|sp|Q9JTS2|MUTL_NEIMA 36.94 58.28 314 183 5 8 316 4 307 5e-53 209ref|NP_000240.1||gi|4557757 gi|56749233|sp|Q6GHD9|MUTL_STAAR 38.28 58.46 337 193 7 6 335 2 330 1e-52 207ref|NP_000240.1||gi|4557757 gi|25090739|sp|Q8NWX9|MUTL_STAAW 38.28 58.46 337 193 7 6 335 2 330 1e-52 207ref|NP_000240.1||gi|4557757 gi|71151979|sp|Q5HGD5|MUTL_STAAC 38.28 58.46 337 193 7 6 335 2 330 1e-52 207ref|NP_000240.1||gi|4557757 gi|54037875|sp|P65492|MUTL_STAAN 38.28 58.46 337 193 7 6 335 2 330 2e-52 207ref|NP_000240.1||gi|4557757 gi|20043258|sp|Q9KV13|MUTL_VIBCH 35.74 58.56 333 204 6 8 335 3 330 2e-52 207ref|NP_000240.1||gi|4557757 gi|127553|sp|P14161|MUTL_SALTY 35.10 56.93 339 205 7 8 334 3 338 3e-52 206ref|NP_000240.1||gi|4557757 gi|20455140|sp|Q9CDL1|MUTL_LACLA 36.31 56.55 336 196 5 6 334 2 326 4e-52 206ref|NP_000240.1||gi|4557757 gi|61214242|sp|Q7MH01|MUTL_VIBVY 34.63 58.51 335 213 5 8 339 3 334 4e-52 206ref|NP_000240.1||gi|4557757 gi|20455099|sp|Q8Z187|MUTL_SALTI 35.10 56.93 339 205 7 8 334 3 338 4e-52 206ref|NP_000240.1||gi|4557757 gi|31076809|sp|Q8DCV0|MUTL_VIBVU 34.63 58.51 335 213 5 8 339 3 334 6e-52 205ref|NP_000240.1||gi|4557757 gi|71648717|sp|Q5E2C6|MUTL_VIBF1 36.71 59.81 316 186 6 8 316 3 311 1e-51 204ref|NP_000240.1||gi|4557757 gi|37999611|sp|Q88DD1|MUTL_PSEPK 30.34 48.97 435 278 7 8 419 7 439 2e-51 203
Importable into spreadsheetsImportable into spreadsheets
NC
BI
Fie
ldG
uid
e
PSSMs: Restart PSI-BLAST
ASCII encoded, Web onlyASCII encoded, Web only
ASN.1 ScoreMat, PortableASN.1 ScoreMat, Portable
NC
BI
Fie
ldG
uid
e
BLAST TreeView
Black bear mt genome vs. RefSeq GenomicBlack bear mt genome vs. RefSeq Genomic
NC
BI
Fie
ldG
uid
e
Distance Tree Carnivore Mitochondrial Genome
bearswalrus fur seal
sea lions
true seals
dogsmongooses
catsred pandaweasels
raccoon
NC
BI
Fie
ldG
uid
e
Managing Searches
Recent Results
Saved Strategies
NC
BI
Fie
ldG
uid
e
Recent Results
Login to My NCBI to save search strategiesLogin to My NCBI to save search strategies
Results available for 36 hoursResults available for 36 hours
NC
BI
Fie
ldG
uid
e
Saved Strategies
Re-run searches to keep up to dateRe-run searches to keep up to date
NC
BI
Fie
ldG
uid
e
Genome and Specialized BLAST
NC
BI
Fie
ldG
uid
e
Genome BLAST pages
NC
BI
Fie
ldG
uid
e
Map Viewer Homepage
NC
BI
Fie
ldG
uid
e
Poplar Genome BLAST
NC
BI
Fie
ldG
uid
e
tblastn Genome BLAST Results
Protein-nucleotide alignmentsProtein-nucleotide alignments
Exons and genes mixedExons and genes mixed
NC
BI
Fie
ldG
uid
eGenomic Context of BLAST Hits
NC
BI
Fie
ldG
uid
e
Hits in Map Viewer
NC
BI
Fie
ldG
uid
e
Specialized BLAST Pages
NC
BI
Fie
ldG
uid
e
Service Addresses
•General Help [email protected]•BLAST [email protected]
Telephone support: 301- 496- 2475