sequence identification using blast - craig...

Sequence Identification using BLAST

Vivek Krishnakumar

JCVI Genomic Science and Leadership WorkshopPresented on: 05/26/2016

Overview

• Introduction• Why compare sequences?• Sequence alignment steps• Causes of sequence (dis)similarity• Comparing two sequences• Scoring a sequence alignment• Introduction to BLAST

Available programs Similarity statistics: Score and Expect value BLAST output Examples of various types of alignments

• FIN

Premise of Pairwise sequence alignment

• One sequence by itself is not informative; it must be analyzed by comparative methods against existing sequence databases to develop hypothesis concerning relatives and function.

3

Image from Cesar Dog Food company

Pairwise sequence alignment as an Experiment

• Probably the most common “experiment” done in biology today

• Formally considered an experiment because you don’t know what you’ll get until you perform the operation

• As an experiment, it is based on a hypothesis; it uses a reproducible technique and it generates results that lead to conclusions or more experiments

Why compare sequences?

• “Match” two (pair-wise) or several (multiple) protein or nucleotide sequences to one another to assess their similarity

• Sequence similarity suggests similar function• Similarity help us investigate evolution

Sequence Alignment

• Sequence alignment is the assignment of residue-residue correspondences. It involves: precise operators for alignment: matching, gaps quantitative scoring system for matches and gaps systematic search among possible alignments

Algorithm: a sequence of instructions that one must perform in order to solve a well-formulated problem

Problem: describes a class of computational tasks; for instance, an input from that task is one particular problem

Sequence alignment: found by the use of an alignment algorithm

⓵⓶⓷

Causes for sequence (dis)similarity

• Mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA)

• Insertion: at a certain location one new nucleotide is inserted in between two existing nucleotides (e.g.: AA → AGA)

• Deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G)

• Indel: an insertion or a deletion

Comparing two sequences

• Point mutations, easy:ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT

• Insertions/deletions, difficult to compare:ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT

Scoring a sequence alignment

• Match score: +1• Mismatch score: +0• Gap penalty: –1ACGTCTGATACGCCGTATAGTCTATCT

||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

Score = +11• Matches: 18 × (+1)• Mismatches: 2 × 0• Gaps: 7 × (– 1)

Gap opening and extension penalties

• We want to find alignments that are evolutionarily likely.

• Which of the following alignments seems more likely to you?ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT

• We can achieve this by penalizing more for a new gap, than for extending an existing gap

Scoring a sequence alignment (2)

• Match/mismatch score: +1/+0• Gap opening/extension penalty: –2/–1ACGTCTGATACGCCGTATAGTCTATCT

||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

Score = +7

• Matches: 18 × (+1)• Mismatches: 2 × 0• Gap opening: 2 × (–2)• Extension: 7 × (–1)

How can we find an optimal alignment?

• Finding the alignment is computationally hard:ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG—CATCGTC--T-ATCT

• C(27 bases,7 gaps) = ~888,000 possibilities• Two options:

Dynamic programming (most optimal: but computationally time consuming)

Heuristics (most speediest: trade off on optimality, completeness, accuracy/precision

BLAST

• Basic Local Alignment Search Tool (1990)Altschul, Gish, Miller, Myers, & Lipman

• Uses short-cuts or “heuristics” to improve search speed

• Like speed-reading, does not examine every nucleotide of database

• However, many more choices (parameters) to make to adjust search success (over 30!!)

• Provides statistical significance• Available on the web, standalone, and network

clients

BLAST programs

http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHomeNew

Sequence Similarity Searching – The statistics are important

• Discriminating between real and artifactual matches is done using an estimate of probability that the match might occur by chance.

• Two types of metrics: scores (S) and e-values (E), are associated with BLAST hits

Where does the score (S) come from?

• The quality of each pair-wise alignment is represented as a score(S) and the scores are ranked.

• Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein).

• The alignment score will be the sum of the scores for each position.

A C G TA 1 -2 -2 -2C -2 1 -2 -2G -2 -2 1 -2T -2 -2 -2 1

What does the E-value really mean?

• The significance of each alignment is computed as an E value (E).Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

• Statistical significance depends on both the size of the alignments and the size of the sequence database Important consideration for comparing results across

different searches E-value increases as database gets bigger E-value decreases as alignments get longer

BLAST output

• List of sequences with scores Raw score (S)

– Higher is better– Depends on aligned

length

Expect Value (E-value)

– Smaller is better– Dependent on length

and database size

• List of alignments

Query: HBA_HUMAN Hemoglobin alpha subunit Sbjct: HBB_HUMAN Hemoglobin beta subunit

Score = 114 bits (285), Expect = 1e-26Identities = 61/145 (42%), Positives = 86/145 (59%), Gaps = 8/145 (5%)

Query 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGSAQV 55L+P +K+ V A WGKV + E G EAL R+ + +P T++F F D G+ +V

Sbjct 3 LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 60

Query 56 KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA 115

K HGKKV A ++ +AH+D++ + LS+LH KL VDP NF+LL + L+LA H

Sbjct 61 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 120

Query

Sbjct

116

121

EFTPAVHASLDKFLASVSTVLTSKY 140 EFTP V A+ K +A V+ L KY EFTPPVQAAYQKVVAGVANALAHKY145

Very Similar Sequences

Query: HBA_HUMAN Hemoglobin alpha subunit Sbjct: MYG_HUMAN Myoglobin

Score = 51.2 bits (121), Expect = 1e-07,Identities = 38/146 (26%), Positives = 58/146 (39%), Gaps = 6/146 (4%)

Query 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGSAQV 55 LS + V WGKV A +G E L R+F P T F F D S +

Sbjct 3 LSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDL 62

Query 56 KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA 115

SbjctK HG V AL + + L+ HA K ++ + +S C++ L + P

63 KKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPG 122

Query 116 EFTPAVHASLDKFLASVSTVLTSKYR 141

+F +++K L + S Y+Sbjct 123 DFGADAQGAMNKALELFRKDMASNYK 148

Quite Similar Sequences

Query: Sbjct:

HBA_HUMAN Hemoglobin alpha subunit SPAC869.02c [Schizosaccharomyces pombe]

Score = 33.1 bits (74), Expect = 0.24Identities = 27/95 (28%), Positives = 50/95 (52%), Gaps = 10/95 (10%)

Query 30 ERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAH 89++M ++P P+F+ +H + + +A AL N ++DD+ +LSA D

Sbjct 59 QKMLGNYPEV---LPYFNKAHQISL--SQPRILAFALLNYAKNIDDL-TSLSAFMDQIVV 112

Query 90 K---LRVDPVNFKLLSHCLLVTLAAHLPAEF-TPA 120 K L++ ++ ++ HCLL T+ LP++ TPA

Sbjct 113 KHVGLQIKAEHYPIVGHCLLSTMQELLPSDVATPA 147

Not Similar Sequences

NCBI BLAST - Web Resources

• NCBI BLAST Webpage: http://www.ncbi.nlm.nih.gov/BLAST/

• BLAST Handbook: http://www.ncbi.nlm.nih.gov/books/NBK153387/

• BLAST FAQs: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ

• Comprehensive list of BLAST related references: http://www.ncbi.nlm.nih.gov/blast/blast_references.shtml

sequence identification using blast - craig...

Documents