local alignment, blast and psi-blast october 25, 2012 local alignment quiz 2 learning...

22
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2 to determine local sequence similarities. Homework #6 due Nov 1 Chapter 5, Problem 8 Chapter 6, Problems 1 and 4.

Upload: derrick-norris

Post on 02-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Local alignment, BLAST and Psi-BLAST

October 25, 2012Local alignment Quiz 2Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2 to determine local sequence similarities.Homework #6 due Nov 1 Chapter 5, Problem 8 Chapter 6, Problems 1 and 4.

Local Alignment

1. Initialize the i-1 row and j-1 column with zeros.

2.

3. For traceback, start with highest value and traceback to zero.

Local Alignment (continued)

Which software program should one use for local alignment?

Most researchers use methods for determining local similarities: Smith-Waterman (gold standard) FASTA BLAST }Do not find every possible alignment

of query with database sequence. Theseare used because they run faster than S-W

BLAST

Three phases:

1) List of high scoring words

2) Scan the sequence database

3) Extend hits

The threshold and word size

The program declares a hit if the word taken from the query sequence has a score >= T when a scoring matrix is used.

This allows the word size (W) to be kept high (for speed) without sacrificing sensitivity.

If T is increased, the number of background hits is reduced and the program will run faster.

Phase 1: Compile a list of high-scoring words at or above threshold T.Query sequence is human p53: . . . RCPHHERCSD. . .Words derived from query sequence: RCP, CPH, PHH, HHE, …Threshold T (T = 17):

WordScores from BLOSUM scoring matrix

Total score

RCP 5 + 9 + 7 21

KCP 2 + 9 + 7 18

QCP 1 + 9 + 7 17

ECP 0 + 9 + 7 16

Note: The line is located at the threshold cutoff.Word size is 3.

. . .

. . .

Phase 2: Scan the database for short segments thatmatch the list of acceptable words/scores above or equal to threshold T. These are potential hits.

Phase 3: Extend the potential hits to the left and to the right and terminate when the tabulated score drops below a cutoff score.

Query EVVRRCPHHERCSD EVVRRCPHHER S+Sbjct EVVRRCPHHERSSE (Ch. hamster p53 O09185)

If the sequence alignment is extended far enough, and the scoreis higher than the alignment score the query/sbjct segmentis called a hit.

The relationship between extension length and cumulative score

The steps to a Gapped BLAST search.

What are the different BLAST programs?

blastp compares an amino acid query sequence against a protein sequence

database blastn compares a nucleotide query sequence against a nucleotide sequence

database blastx compares a nucleotide query sequence translated in all reading frames

against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database

dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against

the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.

What are the different BLAST programs? (continued)

psi-blast Compares a protein sequence to a protein database. Performs the

comparison in an iterative fashion in order to detect homologs that are evolutionarily distant.

blast2 Compares two protein or two nucleotide sequences.

The E value (false positive expectation value)

The Expect value (E) is a parameter that describes the number of “hits” one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially as the Similarity Score (S) increases (inverse relationship). The higher the Similarity Score, the lower the E value. Essentially, the E value describes the random background noise that exists for matches between two sequences. The E value is used as a convenient way to create a “significance” threshold for reporting results. When the E value is increased from the default value prior to a sequence search, a larger list with more low-similarity scoring hits can be reported. An E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size you might expect to see 1 match with a similar score simply by chance.

E value (Karlin-Altschul statistics)E = K•m•n•e-λS

Where K is a scaling factor (constant), m is the length of the query sequence, n is the length of the database sequence, λ is the decay constant, S is the similarity score.

If S increases, E decreases exponentially.If the decay constant increases, E decreases exponentiallyIf m•n increases the “search space” increases. Then there is a greater

chance for a random “hit” and E increases. A larger database will increase E. However, larger query sequence often results in a lower E value. Why???

Thought problem

A homolog to a query sequence resides in two databases. One is the UniProt database and the other is the PDB database. After performing BLAST search against the UniProt database you obtain an E value of 1. After performing the BLAST search against the PDB database you obtain an E value of 0.0625. What is the ratio of the sizes of the two databases?

Using BLAST to get quick answers to bioinformatics problems

Task BLAST method Trad. Method

Predict protein function (1)

Perform blastp on PIR or Swiss-Prot database

Perform wet-lab experiment

Predict protein function (2)

Perform tblastn on NR database

Perform wet-lab experiment

Predict protein structure

Perform blastp against PDB

Structure prediction software, x-ray crystal., NMR

Using BLAST to get quick answers to bioinformatics problems (cont.)

Task BLAST method Trad. Method

Locate genes in a genome

Divide genome into 2-5 kb sequences. Perform blastx against NR protein datbase

Run gene prediction software. Perform microarray analysis or RNAs

Find distantly related proteins

Perform psi-blast No traditional method

Identify DNA sequence

Perform blastn Screen genomic DNA library

Filtering Repetitive Sequences

Over 50% of genomic DNA is repetitiveThis is due to: retrotransposons ALU region microsatellites centromeric sequences, telomeric sequences 5’ Untranslated Region of ESTs

Example of EST with simple low complexity region:

T27311GGGTGCAGGAATTCGGCACGAGTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC

Filtering Repetitive Sequences and Masking

Options available for user.

PSI-BLAST

PSI-position specific iterativea position specific scoring matrix (PSSM) is constructed automatically from multiple HSPs of initial BLAST search. Normal E value threshold is used.The PSSM is created as the new scoring matrix for a second BLAST search. A low E value threshold is used (E=.001).Result-1) obtains distantly related sequences

2) finds the important residues that provide function or structure.

Workshop

Is the American crocodile (Crocodylus acutus) more closely related to the sea turtle (Cheloniidae) or to the turkey (Meleagris gallopavo)? Choose two genes from each species and compare using blast2. Record bit score, E-value, percent nucleotide identities, percent similarities and lengths of coverage query/sbjct sequences in your answer.