sequence identification using blast - craig...
TRANSCRIPT
Sequence Identification using BLAST
Vivek Krishnakumar
JCVI Genomic Science and Leadership WorkshopPresented on: 05/26/2016
Overview
• Introduction• Why compare sequences?• Sequence alignment steps• Causes of sequence (dis)similarity• Comparing two sequences• Scoring a sequence alignment• Introduction to BLAST
Available programs Similarity statistics: Score and Expect value BLAST output Examples of various types of alignments
• FIN
Premise of Pairwise sequence alignment
• One sequence by itself is not informative; it must be analyzed by comparative methods against existing sequence databases to develop hypothesis concerning relatives and function.
3
Image from Cesar Dog Food company
Pairwise sequence alignment as an Experiment
• Probably the most common “experiment” done in biology today
• Formally considered an experiment because you don’t know what you’ll get until you perform the operation
• As an experiment, it is based on a hypothesis; it uses a reproducible technique and it generates results that lead to conclusions or more experiments
Why compare sequences?
• “Match” two (pair-wise) or several (multiple) protein or nucleotide sequences to one another to assess their similarity
• Sequence similarity suggests similar function• Similarity help us investigate evolution
Sequence Alignment
• Sequence alignment is the assignment of residue-residue correspondences. It involves: precise operators for alignment: matching, gaps quantitative scoring system for matches and gaps systematic search among possible alignments
Algorithm: a sequence of instructions that one must perform in order to solve a well-formulated problem
Problem: describes a class of computational tasks; for instance, an input from that task is one particular problem
Sequence alignment: found by the use of an alignment algorithm
⓵⓶⓷
Causes for sequence (dis)similarity
• Mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA)
• Insertion: at a certain location one new nucleotide is inserted in between two existing nucleotides (e.g.: AA → AGA)
• Deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G)
• Indel: an insertion or a deletion
Comparing two sequences
• Point mutations, easy:ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT
• Insertions/deletions, difficult to compare:ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT
Scoring a sequence alignment
• Match score: +1• Mismatch score: +0• Gap penalty: –1ACGTCTGATACGCCGTATAGTCTATCT
||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT
Score = +11• Matches: 18 × (+1)• Mismatches: 2 × 0• Gaps: 7 × (– 1)
Gap opening and extension penalties
• We want to find alignments that are evolutionarily likely.
• Which of the following alignments seems more likely to you?ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT
• We can achieve this by penalizing more for a new gap, than for extending an existing gap
Scoring a sequence alignment (2)
• Match/mismatch score: +1/+0• Gap opening/extension penalty: –2/–1ACGTCTGATACGCCGTATAGTCTATCT
||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT
Score = +7
• Matches: 18 × (+1)• Mismatches: 2 × 0• Gap opening: 2 × (–2)• Extension: 7 × (–1)
How can we find an optimal alignment?
• Finding the alignment is computationally hard:ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG—CATCGTC--T-ATCT
• C(27 bases,7 gaps) = ~888,000 possibilities• Two options:
Dynamic programming (most optimal: but computationally time consuming)
Heuristics (most speediest: trade off on optimality, completeness, accuracy/precision
BLAST
• Basic Local Alignment Search Tool (1990)Altschul, Gish, Miller, Myers, & Lipman
• Uses short-cuts or “heuristics” to improve search speed
• Like speed-reading, does not examine every nucleotide of database
• However, many more choices (parameters) to make to adjust search success (over 30!!)
• Provides statistical significance• Available on the web, standalone, and network
clients
BLAST programs
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHomeNew
Sequence Similarity Searching – The statistics are important
• Discriminating between real and artifactual matches is done using an estimate of probability that the match might occur by chance.
• Two types of metrics: scores (S) and e-values (E), are associated with BLAST hits
Where does the score (S) come from?
• The quality of each pair-wise alignment is represented as a score(S) and the scores are ranked.
• Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein).
• The alignment score will be the sum of the scores for each position.
A C G TA 1 -2 -2 -2C -2 1 -2 -2G -2 -2 1 -2T -2 -2 -2 1
What does the E-value really mean?
• The significance of each alignment is computed as an E value (E).Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.
• Statistical significance depends on both the size of the alignments and the size of the sequence database Important consideration for comparing results across
different searches E-value increases as database gets bigger E-value decreases as alignments get longer
BLAST output
• List of sequences with scores Raw score (S)
– Higher is better– Depends on aligned
length
Expect Value (E-value)
– Smaller is better– Dependent on length
and database size
• List of alignments
Query: HBA_HUMAN Hemoglobin alpha subunit Sbjct: HBB_HUMAN Hemoglobin beta subunit
Score = 114 bits (285), Expect = 1e-26Identities = 61/145 (42%), Positives = 86/145 (59%), Gaps = 8/145 (5%)
Query 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGSAQV 55L+P +K+ V A WGKV + E G EAL R+ + +P T++F F D G+ +V
Sbjct 3 LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 60
Query 56 KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA 115
K HGKKV A ++ +AH+D++ + LS+LH KL VDP NF+LL + L+LA H
Sbjct 61 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 120
Query
Sbjct
116
121
EFTPAVHASLDKFLASVSTVLTSKY 140 EFTP V A+ K +A V+ L KY EFTPPVQAAYQKVVAGVANALAHKY145
Very Similar Sequences
Query: HBA_HUMAN Hemoglobin alpha subunit Sbjct: MYG_HUMAN Myoglobin
Score = 51.2 bits (121), Expect = 1e-07,Identities = 38/146 (26%), Positives = 58/146 (39%), Gaps = 6/146 (4%)
Query 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGSAQV 55 LS + V WGKV A +G E L R+F P T F F D S +
Sbjct 3 LSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDL 62
Query 56 KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA 115
SbjctK HG V AL + + L+ HA K ++ + +S C++ L + P
63 KKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPG 122
Query 116 EFTPAVHASLDKFLASVSTVLTSKYR 141
+F +++K L + S Y+Sbjct 123 DFGADAQGAMNKALELFRKDMASNYK 148
Quite Similar Sequences
Query: Sbjct:
HBA_HUMAN Hemoglobin alpha subunit SPAC869.02c [Schizosaccharomyces pombe]
Score = 33.1 bits (74), Expect = 0.24Identities = 27/95 (28%), Positives = 50/95 (52%), Gaps = 10/95 (10%)
Query 30 ERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAH 89++M ++P P+F+ +H + + +A AL N ++DD+ +LSA D
Sbjct 59 QKMLGNYPEV---LPYFNKAHQISL--SQPRILAFALLNYAKNIDDL-TSLSAFMDQIVV 112
Query 90 K---LRVDPVNFKLLSHCLLVTLAAHLPAEF-TPA 120 K L++ ++ ++ HCLL T+ LP++ TPA
Sbjct 113 KHVGLQIKAEHYPIVGHCLLSTMQELLPSDVATPA 147
Not Similar Sequences
NCBI BLAST - Web Resources
• NCBI BLAST Webpage: http://www.ncbi.nlm.nih.gov/BLAST/
• BLAST Handbook: http://www.ncbi.nlm.nih.gov/books/NBK153387/
• BLAST FAQs: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ
• Comprehensive list of BLAST related references: http://www.ncbi.nlm.nih.gov/blast/blast_references.shtml