introduction to bioinformatics blast. introduction –what is blast? –query sequence formats...

30
Introduction to Bioinformatics BLAST

Post on 22-Dec-2015

285 views

Category:

Documents


6 download

TRANSCRIPT

Introduction to Bioinformatics

BLAST

BLAST• Introduction

– What is BLAST? – Query Sequence Formats– What does BLAST tell you?

• Choices– Variety of BLAST – BLAST Programs: Which One to Use?– Commonly Used BLAST programs – BLAST Databases: Which One to Search?

• Understanding the Output • Database Search with BLAST • Blast Steps – How It Works

Acknowledgement: The presentation includes adaptations from NCBI’sIntroduction to Molecular Biology Information Resources Modules

What is BLAST?

• Basic Local Alignment Search Tool

• The GoogleTM of bioinformatics• Query is a DNA or protein sequence, not a

text term

• Character string comparison against all the sequences in the target database

• Rigorous statistics used to identify statistically significant matches

Query Sequence Formats

• Bare sequence– QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE

KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP

– 1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn

61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels

181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp

• Identifiers– accession, accession.version or gi's– e.g., p01013, AAA68881.1, 129295, gi|129295

• FASTA format

Query Sequence in FASTA Format

• FASTA definition line ("def line") that begins with a >, followed by some text that briefly describes the query sequence on a single line

• Up to 80 nucleotide bases or amino acids per line• Blank lines not allowed in the middle• Example

– >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP

• Additional information

What does BLAST tell you?

• Putative identity and function of your query sequence

• Helps to direct experimental design to prove the function

• Find similar sequences in model organisms (e.g., yeast, C. elegans, mouse), which can be used to further study the gene

• Compare complete genomes against each other to identify similarities and differences among organisms

Variety of BLASTs:http://www.ncbi.nlm.nih.gov/BLAST/

BLAST Programs: Which One to Use?

Depends on:

• What type of query sequence you have (nucleotide or protein)

• What type of database you will search against (nucleotide or protein)

• BLAST program descriptions – brief list – BLAST program selection guide

Commonly Used BLAST Programs

• Examples of BLAST programs– BLASTN

• Nucleic acids against nucleic acids– BLASTP

• Protein query against protein database• Usually better to use than nucleotide-nucleotide BLAST • Since the genetic code is degenerate, blastn can often give

less specific results than blastp • ...but... what if we don't have a protein query sequence. What

are our options?– BLASTX

• Translated nucleic acids against protein database• One way to do a protein BLAST search if you have a

nucleotide query sequence • The BLAST program does the translating for you, in all 6

reading frames

BLAST Databases: Which One to Search?

What type of data do you want to search against? For example:

• Characterized sequences?

• Specialized sequences?

• Complete genomes or chromosomes?

• BLAST database descriptions are available in the: – BLAST help document – BLAST program selection guide

Request ID: RID

• An RID is like a ticket number that allows you to retrieve your search results and format them in many different ways over the next 24 hours.

• If you've saved RIDs from your recent searches, you can enter the RIDs directly using the Retrieve results with a Request ID page, which is accessible from the bottom of the BLAST home page

Search Results: Understanding the Output

• Reference to BLAST paper• Reminders about your specific query

– RID – query sequence reminder (contains the information from your

FASTA def line) – what database you searched against

• Graphical summary – shows where the hits aligned to your query – colors indicate score range – mouse over a colored bar to see info about that hit

• Text summary (GI numbers and Def lines) – GI links to complete record in Entrez – Score links to pairwise alignment between your query sequence

and the hit • Pairwise alignments• BLAST statistics for your search

Database Search w/ BLAST

• Primary use of bioinformatics– Finding similar sequences– BLAST

Acknowledgement: Slides 15 – 19 are adapted from lecture notes of Professor Chau-Wen Tseng of CS Department at the University of Maryland with permission.

Database Search w/ BLAST

• Set up format options and hit the Format button

Click button!

RID

Database Search w/ BLAST

• Versions of BLAST– BLASTN

• Nucleic acids against nucleic acids

– BLASTP• Protein query against protein database

– BLASTX• Translated nucleic acids against protein database

– TBLAST• Protein query against translated nucleic acid database

– TBLASTX• Translated nucleic acids against translated nucleic acids

Database Search w/ BLAST

Database Search w/ BLAST

• BLAST graphic result

Database Search w/ BLAST

• BLAST resultMatching sequences w/ bit-score & E-valueHyperlinks to database entry for sequence

• Examplegi|17330420|gb|BH384278.1|BH384278 ... 153 3e-36gi|17320126|gb|BH373984.1|BH373984 ... 140 9e-34gi|17338337|gb|BH392196.1|BH392196 ... 112 8e-25gi|20373967|gb|BH771010.1|BH771010 ... 105 1e-21gi|17314411|gb|BH368367.1|BH368367 ... 104 2e-21gi|17332712|gb|BH386570.1|BH386570 ... 64 3e-21

Hyperlinks to sequences Bit Score E-value

BLAST – Statistical Evaluation

• E Value– The number of different alignments with

scores equivalent to or better than alignment score that are expected to occur in a database search by chance.

– The lower the E value, the more significant the score.

BLAST – How It Works

• Find high scoring local alignments between query sequence and target database

• Assumption– True match alignments very likely to contain

within them very high scoring matches

• Steps1. Seeding

2. Searching

3. Extension

4. Evaluation

BLAST Steps

1.Seeding• For each word of length w in the query

(w-mer), generate a list of all possible words (neighbors) with a score of at least threshold T (determined by using the scoring matrix)

• Default• w = 3 for protein• w =11 for DNA

Query word (w = 3)

Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI

PQG 18PEG 15PRG 14PKG 14PNG 13PDG 13PHG 13PMG 13PSG 13PQA 12PQN 12…

Neighborhoodscore threshold (T = 13)

Neighborhoodwords

This example uses BLOSUM 62.

BLOSUM 62

BLAST Steps

2. Searching• Determine the locations of all common

“words” between the query and the database (“word hits”)

• Identifies all word hits

Query word (w = 3)

Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI

Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA

Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA

PQG 18PEG 15PRG 14PKG 14PNG 13PDG 13PHG 13PMG 13PSG 13PQA 12PQN 12…

Neighborhoodscore threshold (T = 13)

Neighborhoodwords

Hit

BLAST Steps3. Extension

• Extend hits to find HSPs (high-scoring segment pairs) that have scores higher than a threshold

• Introduce gaps using dynamic programming

• Problem of extension• Time-consuming to find the highest score

• Solution (heuristic)• Extend until score drops a value of X Example: ABCDEFGHIJKLMNOPQRST

|||||| ||||| | ABCDEFZYIJKLMXWVUTAB 1234565456789876565 Score 00000012100001234345 Drop off score

Match = 1Mismatch = -1X = 5

Query word (W = 3)

Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI

Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA+LA++L+ TP G R++ +W+ P+ D + ER + A

Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA

PQG 18PEG 15PRG 14PKG 14PNG 13PDG 13PHG 13PMG 13PSG 13PQA 12PQN 12…

Neighborhoodscore threshold (T = 13)

Neighborhoodwords

Hit

BLAST Steps3. Evaluation

• Maximal segment pairs (MSPs) – maximum-scoring HSPs

• Evaluate the statistical significance of extended hits (HSPs)

• Report only those above the determined threshold (MSPs)

BLAST – Statistical Evaluation

For local, ungapped alignments:

m: size of query

n: size of database

E: expected # of HSPs with scores at least S

p: prob of finding at least one HSP with S

good tutorial at:http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

SKmneE Eep 1

Interpretations of Expected Value

• Expected value ranges– E < 10-100 → very low, homologs or identical genes– E < 10-3 → moderate, may be related genes– E > 1 → high, probably / may be unrelated– 0 0.5 < E < 1 → ??? In the “twilight zone” Try detailed

search• If database search

– Long list of gradually declining of E values → large gene family

– Long regions of moderate similarity → more significant than short regions of high identity

• Biological relevance– Still need to determine biological significance!!!