introduction to bioinformatics - tutorial no. 2 blast
Post on 20-Dec-2015
233 views
TRANSCRIPT
Introduction to Bioinformatics - Tutorial no. 2
BLAST
BLAST
BLAST – Outline
Sequence Alignment Complexity and indexing BLASTN and BLASTP
Basic parameters PAM and BLOSUM matrices Affine gap model E Values (once again)
Advanced BLAST
Databases BLAST options BLAST output Taxonomic BLAST Pairwise BLAST
Name Query type Database
blastn Genomic Genomic
blastp Protein Protein
blastx Translated genomic Protein
tblastn Protein Translated genomic
tblastx Translated genomic Translated genomic
Genomic translations test all 6 possibilities:
3x for codon frames, 2x for reverse complement
BLAST Variations
BLASTN Databases
nrGenBank, EMBL, DDBJ, PDB and NCBI
reference sequences (RefSeq)
htgs High-throughput genomic sequences (draft)
pat Patented nucleotide sequences
mito Mitochondrial sequences
vector Vector subset of GenBank
month GenBank, EMBL, DDBJ, PDB from 30 days
chrom Contigs and chromosomes from RefSeq
BLASTP Databases
nrGenBank CDS translations, RefSeq,
PDB, SWISS-PROT, PIR, PRF
swissprot SWISS-PROT
pat Patented protein sequences
pdb Protein Data Bank
monthGenBank CDS translations, PDB,
SWISS-PROT, PIR, PRF from 30 days
BLASTN/P Options (1)
Only search part of database using NCBI Entrez query format
Search specific
organism
Remove low information content, e.g. short repeats or
rich in only 2 nucleotides
Remove known human repeats
(LINEs, SINEs)
BLASTN/P Options (2)Threshold for results
significance
Use index based on words of 7, 11 or 15
nucleotides Costs to open and extend gap, score for nucleotide
match or mismatch. Allowed gap scores: 10/1, 10/2, 11/1, 8/2, 9/2
BLASTP Options
Scoring matrix: PAM, etc…
Search for a motif (PSI-BLAST)
Costs to open and extend gap
BLASTN/P Formatting (1)
Show colored bar chart
Number of sequences listed
Number of alignments shown
Other (less important) options on
what to show
BLASTN/P Formatting (2)
How to display alignments
Only show results which match Entrez search or are from specific organism
Only show results with E values in this range
BLASTN Results
Query sequence representation
Matched areas of database sequences
BLAST Output Header
Request ID for later retrieval
Query sequence details
Database details
Tax BLAST
BLAST Alignments (1)
Sequence Identifier
Sequence description
Score andE value
BLAST Alignments (2)
Normalized score of alignment
Expected number of such hits (2e-11 = 2 10-11)
Number of exact matches
Number of matches with positive score
Number of insertion / deletions
BLAST Alignments (3)
Query sequenceExact matchInsertion / deletion
Matched sequence
Mismatch with positive
score
Position within sequence Masked low complexity region
Expectation Values
Increases linearly with
length of query sequence
Increases linearly with
length of database
Decreases exponentially with score of
alignment
Tax BLAST
Lineage of organism with strongest hit
Score of organism’s strongest hit
Number of organism hits
Shared ancestry in taxonomic tree
BLAST2SEQ
Scoring scheme
Type of program
Gap model,Expect Value,
Advanced options
Sequences
Scoring matrix
SequencesGO !
This tool produces the alignment of two given sequences using BLAST engine for local alignment.
QuestionsYou have two query sequences: query1 and query2:
>query1CCGTCCGTCCGTCGTCCTCCTCGCTTGCGGGGCGCCGGGCCCGTCCTCGAGCCCCCNNNNNCCGTCCGGCCGCGTCGGGGCCTCGCCGCGCTCTACCTACCTACCTGGTTGATCCTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTACGCACGGCCGGTACAGTGAAACTGCGAATGGCTCATTAAATCAGTTATGGTTCCTTTGGTCGCTCGCTCCTCTCCTACTTGGATAACTGTGGTAATTCTAGAGCTAATACATGCCGACGGGCGCTGACCCCCTTCGCGGGGGGGATGCGTGCATTTATCAGATCAAAACCAACCCGGTCAGCCCCTCTCCGGCCCCGGCCGGGGGGCGGGCCGCGGCGGCTTTGGTGACTCTAGATAACCTCGGGCCGATCGCACGCCCCCCGTGGCGGCGACGACCCATTCGAACGTCTGCCCTATCAACTTTCGATGGTAGTCGCCGTGCCTACCATGGTGACCACGGGTGACGGGGAATCAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGACCCGGGGAGGTAGTGACGAAAAATAACAATACAGGACTCTTTCGAGGCCCTGTAATTGGAATGAGTCCACTTTAAATCCTTTAACGAGGATCCATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGCTGCAGTTAAAAAGCTCGTAGTTGGATCTTGGGAGCGGGCGGGCGGTCCGCCGCGAGGCGAGCCACCGCCCGTCCCCGCCCCTTGCCTCTCGGCGCCCCCTCGATGCTCTTAGCTGAGTGTCCCGCGGGGCCCGAAGCGTTTACTTTGAAAAAATTAGAGTGTTCAAAGCAGGCCCGAGCCGCCTGGATACCGCAGCTAGGAATAATGGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGAACTGAGGCCATGATTAAGAGGGACGGCCGGGGGCATTCGTATTGCGCCGCTAGAGGTGAAATTCTTGGACCGGCGCAAGACGGACCAGAGCGAAAGCATTTGCCAAGAATGTTTTCATTAATCAAGAACGAAAGTCGGAGGTTCGAAGACGATCAGATACCGTCGTAGTTCCGACCATAAACGATGCCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCCGGGCAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGAAACCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTCGATTCCGTGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGCGATTTGTCTGGTTAATTCCGATAACGAACGAGACTCTGGCATGCTAACTAGTTACGCGACCCCCGAGCGGTCGGCGTCCCCCAACTTCTTAGAGGGACAAGTGGCGTTCAGCCACCCGAGATTGAGCAATAACAGGTCTGTGATGCCCTTAGATGTCCGGGGCTGCACGCGCGCTACACTGACTGGCTCAGCGTGTGCCTACCCTACGCCGGCAGGCGCGGGTAACCCGTTGAACCCCATTCGTGATGGGGATCGGGGATTGCAATTATTCCCCATGAACGAGGAATTCCCAGTAAGTGCGGGTCATAAGCTTGCGTTGATTAAGTCCCTGCCCTTTGTACACACCGCCCGTCGCTACTACCGATTGGATGGTTTAGTGAGGCCCTCGGATCGGCCCCGCCGGGGTCGGCCCACGGCCTGGCGGAGCGCTGAGAAGACGGTCGAA
Questions>query2TACGAACGCTGGCGGCATGCTAATACATGCAAGTCGAACGAGACCTTCGGGTCTAGTGGCGCACGGGTGGCTAACGCGTGGGAATCTGCCCTTGGGTTCGGAATAACTTCGGGAAACTGAAGCTAATACCGGATGATGACGAAAGTCCAAAGATTTATCGCCCAGGGATGAGCCCGCGTAGGATTAGCTAGTTGGTGGGGTAAAGGCTCACCAAGGCAACGATCCTTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGAGATGATAATGACAGTATCGGGAGAATAAGCTCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCGATTTAAGTCAGAGGTGAAAGCCCGGGCTCAACCCCGAACTGCCTTTGAGACTGGATTGCTAGAATCTTGGAGAGGCGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGCGAAGGCGGCTCGCTGGACAAGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGATAACTAGCTGCCGGGGCACATGGTGTTTCGGTGGCGCACGTAACGCATTAAGTTATCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCCTGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCAGCGTTTGACATCCTCATCGCGGATTTCAGAGATGATTTCCTTCAGTTCGGCTGGATGAGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTTAGTTGCCAGCATTTAGTTGGGTACTCTAAAGGAACCGCCGGTGATAAGCCGGAGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACGCGCTGGGCTACACACGTGCTACAATGGCGACTACAGTGGGCTGCAACCGTGCGAGCGGTAGCTAATCTCCAAAAGTCGTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGGCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCAGGCCTTGTACACACCGCCCGTCACACCATGGGATTTGGATTCACCCGAAGGCACTGCGTTAACCCGCAAGGGAGACAGGTGACCACGGTGGGTTTAGAGACTGGGGTGAA
QuestionsUsing BLASTN • Find what do each one of these sequences code
for.
Questions
Questions
• To which organism each sequence is related? • Do these sequences code for proteins?
Pretend the information for answering previous questions is not available to you could you suggest a way to answer these questions anyway?
BLASTX
Questions
• Look carefully at the e-value column of the first 50 results of each query. What can you learn about these sequences? Are these sequences generally conserved between other organisms?
5 last answers
Questions
• Use bl2seq to align the two query sequences. What can you say about the relation between them? Based does this last result make sense?
QuestionsYou have two query sequences.>query3 ATGTCTGCTCCACAAGCCAAGATTTTGTCTCAAGCTCCAACTGAATTGGAATTACAAGTTGCTCAAGCTTTCGTTGAATTGGAAAATTCTTCTCCAGAATTGAAAGCTGAGTTGAGACCTTTGCAATTCAAGTCCATCAGAGAAGT
>query4 GTATGTTATTAATTTGAATCTAAACTTAAGAATAATGGAGAGTAACAAAGGAAAAAAGTGTGAACGGGACGATACCAGAATGTTTCAATCTAGAAAAGTATAAAAGATAAGGACTAGGACTCAAATGTATTTGGCTGACTATCGCCTGAACCTTGATGCTAAGCAAATACCATATCTTCAAGAAAAAGCCTACTCCAGTGTTTAAGAAGAAGGGAACGATTTACTAGATCATGCTATACGCAGTAAGGTTCTGATAGTTAATTACAATCGGTCCAAGTTCTAAGCGGTGTCGTCCATGCATATATCATTTACAAGTTACTGGCGTCAACTCTTCAAATATTCAAAATATCACCTAATCAAACTTACTAACATTTTCCTTTTTTGTTTTCCTTCTTTTATAG
Now use BlastX
• To what protein does these sequences code for?
• are these proteins conserved in other organisms?
QuestionsNow use BlastX
• To what protein does these sequences code for?
• are these proteins conserved in other organisms?
A conserved protein component of the small (40S) subunit of S. cerevisiae.
Query 3Query 4
No protein – e-value 3.2
Questions• You are told that the sequences were extracted from the
same gene. How could you explain the above results?
• Answer: query4 is extracted from a non-coding region (intron) and thus doesn’t code for any protein.