introduction to bioinformatics blast. introduction –what is blast? –query sequence formats...
Post on 22-Dec-2015
285 views
TRANSCRIPT
BLAST• Introduction
– What is BLAST? – Query Sequence Formats– What does BLAST tell you?
• Choices– Variety of BLAST – BLAST Programs: Which One to Use?– Commonly Used BLAST programs – BLAST Databases: Which One to Search?
• Understanding the Output • Database Search with BLAST • Blast Steps – How It Works
Acknowledgement: The presentation includes adaptations from NCBI’sIntroduction to Molecular Biology Information Resources Modules
What is BLAST?
• Basic Local Alignment Search Tool
• The GoogleTM of bioinformatics• Query is a DNA or protein sequence, not a
text term
• Character string comparison against all the sequences in the target database
• Rigorous statistics used to identify statistically significant matches
Query Sequence Formats
• Bare sequence– QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP
– 1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn
61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels
181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp
• Identifiers– accession, accession.version or gi's– e.g., p01013, AAA68881.1, 129295, gi|129295
• FASTA format
Query Sequence in FASTA Format
• FASTA definition line ("def line") that begins with a >, followed by some text that briefly describes the query sequence on a single line
• Up to 80 nucleotide bases or amino acids per line• Blank lines not allowed in the middle• Example
– >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP
• Additional information
What does BLAST tell you?
• Putative identity and function of your query sequence
• Helps to direct experimental design to prove the function
• Find similar sequences in model organisms (e.g., yeast, C. elegans, mouse), which can be used to further study the gene
• Compare complete genomes against each other to identify similarities and differences among organisms
BLAST Programs: Which One to Use?
Depends on:
• What type of query sequence you have (nucleotide or protein)
• What type of database you will search against (nucleotide or protein)
• BLAST program descriptions – brief list – BLAST program selection guide
Commonly Used BLAST Programs
• Examples of BLAST programs– BLASTN
• Nucleic acids against nucleic acids– BLASTP
• Protein query against protein database• Usually better to use than nucleotide-nucleotide BLAST • Since the genetic code is degenerate, blastn can often give
less specific results than blastp • ...but... what if we don't have a protein query sequence. What
are our options?– BLASTX
• Translated nucleic acids against protein database• One way to do a protein BLAST search if you have a
nucleotide query sequence • The BLAST program does the translating for you, in all 6
reading frames
BLAST Databases: Which One to Search?
What type of data do you want to search against? For example:
• Characterized sequences?
• Specialized sequences?
• Complete genomes or chromosomes?
• BLAST database descriptions are available in the: – BLAST help document – BLAST program selection guide
Request ID: RID
• An RID is like a ticket number that allows you to retrieve your search results and format them in many different ways over the next 24 hours.
• If you've saved RIDs from your recent searches, you can enter the RIDs directly using the Retrieve results with a Request ID page, which is accessible from the bottom of the BLAST home page
Search Results: Understanding the Output
• Reference to BLAST paper• Reminders about your specific query
– RID – query sequence reminder (contains the information from your
FASTA def line) – what database you searched against
• Graphical summary – shows where the hits aligned to your query – colors indicate score range – mouse over a colored bar to see info about that hit
• Text summary (GI numbers and Def lines) – GI links to complete record in Entrez – Score links to pairwise alignment between your query sequence
and the hit • Pairwise alignments• BLAST statistics for your search
Database Search w/ BLAST
• Primary use of bioinformatics– Finding similar sequences– BLAST
Acknowledgement: Slides 15 – 19 are adapted from lecture notes of Professor Chau-Wen Tseng of CS Department at the University of Maryland with permission.
Database Search w/ BLAST
• Versions of BLAST– BLASTN
• Nucleic acids against nucleic acids
– BLASTP• Protein query against protein database
– BLASTX• Translated nucleic acids against protein database
– TBLAST• Protein query against translated nucleic acid database
– TBLASTX• Translated nucleic acids against translated nucleic acids
Database Search w/ BLAST
• BLAST resultMatching sequences w/ bit-score & E-valueHyperlinks to database entry for sequence
• Examplegi|17330420|gb|BH384278.1|BH384278 ... 153 3e-36gi|17320126|gb|BH373984.1|BH373984 ... 140 9e-34gi|17338337|gb|BH392196.1|BH392196 ... 112 8e-25gi|20373967|gb|BH771010.1|BH771010 ... 105 1e-21gi|17314411|gb|BH368367.1|BH368367 ... 104 2e-21gi|17332712|gb|BH386570.1|BH386570 ... 64 3e-21
Hyperlinks to sequences Bit Score E-value
BLAST – Statistical Evaluation
• E Value– The number of different alignments with
scores equivalent to or better than alignment score that are expected to occur in a database search by chance.
– The lower the E value, the more significant the score.
BLAST – How It Works
• Find high scoring local alignments between query sequence and target database
• Assumption– True match alignments very likely to contain
within them very high scoring matches
• Steps1. Seeding
2. Searching
3. Extension
4. Evaluation
BLAST Steps
1.Seeding• For each word of length w in the query
(w-mer), generate a list of all possible words (neighbors) with a score of at least threshold T (determined by using the scoring matrix)
• Default• w = 3 for protein• w =11 for DNA
Query word (w = 3)
Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI
PQG 18PEG 15PRG 14PKG 14PNG 13PDG 13PHG 13PMG 13PSG 13PQA 12PQN 12…
Neighborhoodscore threshold (T = 13)
Neighborhoodwords
This example uses BLOSUM 62.
BLAST Steps
2. Searching• Determine the locations of all common
“words” between the query and the database (“word hits”)
• Identifies all word hits
Query word (w = 3)
Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI
Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA
PQG 18PEG 15PRG 14PKG 14PNG 13PDG 13PHG 13PMG 13PSG 13PQA 12PQN 12…
Neighborhoodscore threshold (T = 13)
Neighborhoodwords
Hit
BLAST Steps3. Extension
• Extend hits to find HSPs (high-scoring segment pairs) that have scores higher than a threshold
• Introduce gaps using dynamic programming
• Problem of extension• Time-consuming to find the highest score
• Solution (heuristic)• Extend until score drops a value of X Example: ABCDEFGHIJKLMNOPQRST
|||||| ||||| | ABCDEFZYIJKLMXWVUTAB 1234565456789876565 Score 00000012100001234345 Drop off score
Match = 1Mismatch = -1X = 5
Query word (W = 3)
Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI
Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA+LA++L+ TP G R++ +W+ P+ D + ER + A
Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA
PQG 18PEG 15PRG 14PKG 14PNG 13PDG 13PHG 13PMG 13PSG 13PQA 12PQN 12…
Neighborhoodscore threshold (T = 13)
Neighborhoodwords
Hit
BLAST Steps3. Evaluation
• Maximal segment pairs (MSPs) – maximum-scoring HSPs
• Evaluate the statistical significance of extended hits (HSPs)
• Report only those above the determined threshold (MSPs)
BLAST – Statistical Evaluation
For local, ungapped alignments:
m: size of query
n: size of database
E: expected # of HSPs with scores at least S
p: prob of finding at least one HSP with S
good tutorial at:http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
SKmneE Eep 1
Interpretations of Expected Value
• Expected value ranges– E < 10-100 → very low, homologs or identical genes– E < 10-3 → moderate, may be related genes– E > 1 → high, probably / may be unrelated– 0 0.5 < E < 1 → ??? In the “twilight zone” Try detailed
search• If database search
– Long list of gradually declining of E values → large gene family
– Long regions of moderate similarity → more significant than short regions of high identity
• Biological relevance– Still need to determine biological significance!!!