blast and sequence alignment

Download BLAST and sequence alignment

Post on 19-Mar-2017

70 views

Category:

Science

1 download

Embed Size (px)

TRANSCRIPT

OCICB Presentation Template 1

Lecture 2BLAST and Sequence AlignmentPhylogenetics and Sequence AnalysisFall 2015Kurt Wollenberg, PhDPhylogenetics SpecialistBioinformatics and Computational Biosciences BranchOffice of Cyber Infrastructure and Computational Biology

We Are BCBB!2Group of 37Bioinformatics Software DevelopersComputational BiologistsProject Management & Analysis Professionals

Contact BCBBNIH Users: Access a menu of BCBB services on the NIAID Intranet: http://bioinformatics.niaid.nih.gov/Outside of NIH search BCBB on the NIAID Public Internet Page: www.niaid.nih.gov

or use this direct link

Email us at:ScienceApps@niaid.nih.gov

3

Course OrganizationBuilding a clean sequenceCollecting homologsAligning your sequencesBuilding treesFurther analyses4

PreviouslyHierarchical and genealogical dataComparative sequence analysisGenerating clean sequenceTrim vector contaminationTrim low-quality endsAlign fragment overlap to build contigExport contig (consensus)5

Today6Pairwise sequence alignmentHow does it work?BLASTHow does it work?The many flavors of BLASTDemoMultiple Sequence AlignmentHow does it work?DemoInspect and correct your MSA

7 and BLAST: Basic Local Alignment Search ToolSequence Alignment: Assigning homology to sites among a group of known sequencesBLAST: Alignment of one sequence with many unknown sequencesPAIRWISE ALIGNMENT

HOMOLOGY vs. ANALOGY8common ancestryconvergence

Pairing of sites based on an assessment of homology

Homology assessed using Substitution Matrices9PAIRWISE ALIGNMENT

HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KLHBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ KLGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGHBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE10PAIRWISE ALIGNMENT

Substitution Matrices Derived mathematically Derived from dataA substitution matrix (even one derived by arbitrarily assigning probabilities to pairs) is a statement of the probability of observing these pairs in real alignment.11PAIRWISE ALIGNMENT

Single parameter - Jukes-CantorEqual base frequenciesUniform rates of changeTwo parameter - KimuraEqual base probabilitiesTwo rates of change12PAIRWISE ALIGNMENTDNA Substitution Matrices

More parameters - HKYUnequal base frequenciesTwo rates of changeFully parameterized - GTRUnequal base probabilitiesSix rates of changePAIRWISE ALIGNMENTDNA Substitution Matrices13

14

PAIRWISE ALIGNMENTJukes-Cantor Substitution Probabilities

t = 0.25ACGTA0.52590.15800.15800.1580C0.15800.52590.15800.1580G0.15800.15800.52590.1580T0.15800.15800.15800.5259

15PAIRWISE ALIGNMENTJukes-Cantor Substitution Probabilities

PAIRWISE ALIGNMENTKimura Two-Parameter Substitution Model16If the probability of transitions (A G, C T) is different from the probability of transversions (A T, G T, A C, G C), then there are two relative rate parameters expressed as the transition/transversion rate ratio

PAIRWISE ALIGNMENTKimura Two-Parameter Substitution Probabilities17

t = 0.25 k = 2.0ACGTA0.45350.15800.23040.1580C0.15800.45350.15800.2304G0.23040.15800.45350.1580T0.15800.23040.15800.4535

18PAIRWISE ALIGNMENTKimura Two-Parameter Substitution Probabilities

PAIRWISE ALIGNMENTHKY Substitution Probabilities19

PAIRWISE ALIGNMENTHKY Substitution Probabilities20

Jukes-Cantor(One substitution type, equal nucleotide frequencies)

Independent nucl. freq. Two substitution types .

F81/TN82 Kimura 2-Parameter (K2P)

Two substitution types Indep. nucl. freq. Three substitution types .

HKY85/F84 Kimura 3 subst. type (K3ST)

Three substitution types Six substitution types .

Tamura-Nei (TrN) Symmetric (SYM)

Six substitution types Independent nucl. freq. .

General time-reversible (GTR)Substitution Models

Algorithm explained in Felsenstein (2004) Chapter 16.

Protein Score MatricesSimilarity of Amino Acids22From Esquivel RO, et al.. 2013. Advances in Quantum Mechanics, Chapter 27 InTech.

PAIRWISE ALIGNMENT

Protein Score MatricesDerived from empirical dataAccount for depth of relationship among the dataExpressed as log-odds ratio:Logarithm of the ratio of the probabilities of two residues being aligned due to homology versus random chance23PAIRWISE ALIGNMENT

Protein Score (Substitution) MatricesPAIRWISE ALIGNMENT24The log-odds ratio:s(a,b) = log(pab/qaqb)

qa = frequency of residue a in the datapab = probability that residues a and b have been derived from a common ancestor

Protein Substitution MatricesPAM250: Based on phylogenies where all sequences differ by no more than 15%.

BLOSUM62: Based on clusters of sequences with greater than 62% identical residues.PAIRWISE ALIGNMENT25

Protein Substitution MatricesPAM25026

BLOSUM6227Protein Substitution Matrices

28Protein Substitution Matrices

BLAST and Sequence Alignment Global alignment (Needleman-Wunsch)Assign homology across the entire sequenceClustalLocal alignment (Smith-Waterman)Assign homology for subsequencesMUSCLE and BLASTGood for aligning very divergent sequences29How do two sequences get aligned?

HEAGAWGHEE PAWHEAE30SEQUENCE ALIGNMENTBuild a matrix of score values for all site pairs H E A G A W G H E EP 0 -1 1 0 1 -5 0 0 -1 -1A -1 0 2 1 2 -6 1 -1 0 0W -3 -7 -6 -7 -6 17 -7 -3 -7 -7H 6 1 -1 -2 -1 -3 -2 6 1 1E 1 4 0 0 0 -7 0 1 4 4A -1 0 2 1 2 -6 1 -1 0 0E 1 4 0 0 0 -7 0 1 4 4 H E A G A W G H E EP -2 -1 -1 -2 -1 -4 -2 -2 -1 -1A -2 -1 4 0 4 -3 0 -2 -1 -1W -2 -3 -3 -2 -3 11 -2 -2 -3 -3H 8 0 -2 -2 -2 -2 -2 8 0 0E 0 5 -1 -2 -1 -3 -2 0 5 5A -2 -1 4 0 4 -3 0 -2 -1 -1E 0 5 -1 -2 -1 -3 -2 0 5 5PAM250 BLOSUM62

31What about gaps?

Score penalty for openingScore penalty for extending

Penalties are log probabilities of a gap of a specific lengthSEQUENCE ALIGNMENT

32Standard gap costsSubstitution MatrixGap Costs (Open, Extend)PAM30(9,1)PAM70(10,1)BLOSUM80(10,1)BLOSUM62(10,1)BLOSUM45(15,2)

SEQUENCE ALIGNMENT

33Dynamic Programming: Calculate a matrix of alignment scores H E AP -2 -1 -1A -2 -1 4W -2 -3 -3BLOSUM62 H E A 0 -8 -16 -24

P -8

A -16

W -24

-2-17-9

-10-18-11-3-5-6

SEQUENCE ALIGNMENT

34Dynamic ProgrammingCalculate a full matrixTraceback to get the Global Alignment H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65 -73A -16 -10 -3 -5 -13 -21 -29 -37 -45 -53 -61W -24 -18 -11 -6 -7 -15 -10 -18 -26 -34 -41H -32 -16 -18 -13 -8 -9 -17 -12 -10 -18 -26E -40 -24 -11 -19 -15 -9 -12 -19 -12 -5 -13A -48 -32 -19 -7 -15 -11 -12 -12 -20 -13 -6E -58 -40 -27 -15 -9 -16 -14 -14 -12 -15 -8H E A G A W G H E E- - P - A W H E A E

-8

-13

-12

-10

-21

-25

-17

-12

-16

-8

0SEQUENCE ALIGNMENT

34

35Local AlignmentAlignment of subsequencesGood for aligning very divergent sequencesScore CalculationMinimum score is zeroTraceback begins at the highest scoreScore = 0 End of subsequenceSEQUENCE ALIGNMENT

36Local Alignment H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0A 0 0 0 4 0 4 0 0 0 0 0W 0 0 0 0 0 0 15 7 0 0 0H 0 8 0 0 0 0 7 13 15 7 0E 0 0 13 5 0 0 0 5 13 20 12A 0 0 5 17 9 4 0 0 5 12 17E 0 0 5 9 15 8 0 0 0 10 17A W G H EA W - H E H E A G A W G H E ep a w H E A e p A W - H E a e H E A G A W G H E e p A W - H E a e Repeat Match Overlap Match

20

15

7

4

15

17

13

8SEQUENCE ALIGNMENT

37Scoring alignments and expect values

Score := Value in the dynamic programming matrix where the traceback began.Expect (E) value := Number of matches expected due to chance, with a score greater than S, based on a stochastic sequence model.P value := Probability of finding at least one match with score SP = 1-e-E(S)SEQUENCE ALIGNMENT

BLAST(Basic Local Alignment Search Tool) Create a list of query sequence wordsWord lengths: 11 nucleotides, 3 amino acidsCreate a list of neighborhood wordsSimilar to query words and above a score threshold Search for matches in the databaseExtend matchesBelow threshold?Above threshold?Format and output maximally extended matches38How does BLAST work?Discard!Keep it!

39How does BLAST work?How does BLAST evaluate matches?

It uses (local) alignment scoresBLAST(Basic Local Alignment Search Tool)

The Many Flavors of BLASTBLASTn and BLASTpshort, nearly-exact match BLASTTranslated BLASTBLASTx nt aa protein dbtBLASTn aa protein db DNA dbtBLASTx nt aa protein db DNA dbPSI-BLAST (Position-Specific Iterated BLAST)PHI-BLAST (Pattern Hit Initiated BLAST)bl2seqBLAST 40

short, nearly-exact match BLAST Increase Expect threshold Reduce word size (7 for nt, 2 for aa) Turn off l

Recommended

View more >