summer bioinformatics workshop 2008 blast chi-cheng lin, ph.d., professor department of computer...

17
mmer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center [email protected]

Upload: eric-morgan

Post on 20-Jan-2018

213 views

Category:

Documents


0 download

DESCRIPTION

Summer Bioinformatics Workshop What is BLAST? Basic Local Alignment Search Tool The Google TM of bioinformatics query is a DNA or protein sequence, not a text term character string comparison against all the sequences in the target database rigorous statistics used to identify statistically significant matches

TRANSCRIPT

Page 1: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

Summer Bioinformatics Workshop 2008

BLAST

Chi-Cheng Lin, Ph.D., ProfessorDepartment of Computer Science

Winona State University – Rochester [email protected]

Page 2: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

2

Summer Bioinformatics Workshop 2008

BLAST• Introduction

– What is BLAST? – Query Sequence in FASTA Format – What does BLAST tell you?

• Choices – BLAST Programs: Which One to Use?– Commonly Used BLAST programs – BLAST Databases: Which One to Search?

• Understanding the Output • Database Search with BLAST • Blast Steps – How It Works

Acknowledgement: The presentation includes adaptations from NCBI’sIntroduction to Molecular Biology Information Resources Modules

Page 3: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

3

Summer Bioinformatics Workshop 2008

What is BLAST?

• Basic Local Alignment Search Tool• The GoogleTM of bioinformatics• query is a DNA or protein sequence, not a

text term• character string comparison against all the

sequences in the target database• rigorous statistics used to identify

statistically significant matches

Page 4: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

4

Summer Bioinformatics Workshop 2008

Query Sequence in FASTA Format

• FASTA definition line ("def line") that begins with a >, followed by some text that briefly describes the query sequence on a single line

• up to 80 nucleotide bases or amino acids per line

• example and additional information>gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWCHFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK

Page 5: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

5

Summer Bioinformatics Workshop 2008

What does BLAST tell you?

• putative identity and function of your query sequence

• helps to direct experimental design to prove the function

• find similar sequences in model organisms (e.g., yeast, C. elegans, mouse), which can be used to further study the gene

• compare complete genomes against each other to identify similarities and differences among organisms

Page 6: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

6

Summer Bioinformatics Workshop 2008

BLAST Programs: Which One to Use?

Depends on:• what type of query sequence you have

(nucleotide or protein)• what type of database you will search

against (nucleotide or protein)• Most commonly used BLAST programs

– blastn– blastp– blastx

Page 7: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

7

Summer Bioinformatics Workshop 2008

Commonly Used BLAST Programs

• BLASTN– Nucleic acids against nucleic acids

• BLASTP– Protein query against protein database– usually better to use than nucleotide-nucleotide

BLAST – ...but... if we don't have a protein query sequence,

what are our options?• BLASTX

– Translated nucleic acids against protein database– one way to do a protein BLAST search if you have a

nucleotide query sequence – the BLAST program does the translating for you, in all

6 reading frames

Page 8: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

8

Summer Bioinformatics Workshop 2008

Request ID: RID

• An RID is like a ticket number that allows you to retrieve your search results and format them in many different ways over the next 24 hours.

• If you've saved RIDs from your recent searches, you can enter the RIDs directly using the Retrieve results with a Request ID page, which is accessible from the bottom of the BLAST home page

Page 9: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

9

Summer Bioinformatics Workshop 2008

Search Results: Understanding the Output• Reference to BLAST paper• Reminders about your specific query

– RID – query sequence reminder (contains the information from your

FASTA def line) – what database you searched against

• Graphical summary – shows where the hits aligned to your query – colors indicate score range – mouse over a colored bar to see info about that hit

• Text summary (GI numbers and Def lines) – GI links to complete record in Entrez – Score links to pairwise alignment between your query sequence

and the hit • Pairwise alignments• BLAST statistics for your search

Page 10: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

10

Summer Bioinformatics Workshop 2008

Database Search w/ BLAST

Used most often!

Page 11: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

11

Summer Bioinformatics Workshop 2008

Database Search w/ BLAST• Selecting a

BLAST program

• Insert sequence

• Hit “BLAST” near the end of the web page

In general, if you select blastn, select “Others” as your Database to search.

Page 12: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

12

Summer Bioinformatics Workshop 2008

Database Search w/ BLAST

• RID and search status will appear

RID

Page 13: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

13

Summer Bioinformatics Workshop 2008

Database Search w/ BLAST

• Wait for your result (patiently …)

Page 14: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

14

Summer Bioinformatics Workshop 2008

Database Search w/ BLAST• Interpret the result

– Graphic result

– The black color lines are sequences that matched the least while the red lines would be sequences that matched best. In the example below, the purple color sequences are the best matches available.

Source of the image: http://www.bio.davidson.edu/courses/genomics/2006/martens/favorite_gene.html

Page 15: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

15

Summer Bioinformatics Workshop 2008

Database Search w/ BLAST

• BLAST resultMatching sequences w/ bit-score & E-valueHyperlinks to database entry for sequence• Example

Notes that 3e-188 means 3 10-188.

Page 16: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

16

Summer Bioinformatics Workshop 2008

BLAST – Statistical Evaluation

• E Value– The number of different alignments with

scores equivalent to or better than alignment score that are expected to occur in a database search by chance.

– The lower the E value, the more significant the score.

Page 17: Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science…

17

Summer Bioinformatics Workshop 2008

BLAST Steps – How It Works

1. Seeding- Prepare a list of short, fixed-length segments

(words) from the query

2. Searching- Find highly similar or exact match for each

word

3. Extension - Extend each match to (potentially) a longer

match

4. Evaluation- Evaluate the results using E values