pairwise sequence alignment exercise 2

38
Pairwise Pairwise Sequence Sequence Alignment Alignment Exercise 2 Exercise 2

Upload: manju

Post on 21-Jan-2016

64 views

Category:

Documents


0 download

DESCRIPTION

Pairwise Sequence Alignment Exercise 2. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pairwise Sequence Alignment Exercise 2

Pairwise Pairwise Sequence Sequence AlignmentAlignment

Exercise 2Exercise 2

Page 2: Pairwise Sequence Alignment Exercise 2

|| || ||||| ||| || || |||||||||||||||||||MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…

ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA

MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

MotivationMotivation

Page 3: Pairwise Sequence Alignment Exercise 2

What is sequence alignmentWhat is sequence alignment??

Alignment: Alignment: Comparing two (pairwise) or Comparing two (pairwise) or more (multiple) sequences. Searching for more (multiple) sequences. Searching for a series of identical or similar characters in a series of identical or similar characters in the sequences.the sequences.

MVNLTSDEKTAVLALWNKVDVEDCGGE|| || ||||| ||| || || ||MVHLTPEEKTAVNALWGKVNVDAVGGE

Page 4: Pairwise Sequence Alignment Exercise 2

Why sequence alignment?Why sequence alignment?

Predict characteristics of a protein – Predict characteristics of a protein –

Premised on:Premised on:

similar sequence (or structure)similar sequence (or structure)

similar functionsimilar function

Page 5: Pairwise Sequence Alignment Exercise 2

Local vs. GlobalLocal vs. Global Global alignmentGlobal alignment – finds the best – finds the best

alignment across the alignment across the wholewhole two two sequences.sequences.

Local alignmentLocal alignment – finds regions of – finds regions of high similarity in high similarity in partsparts of the of the sequences.sequences.

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLG CDRYFQ|||| |||| |ADLG CDRYYQ

Global alignment:

forces alignment in

regions which differ

Local alignment

concentrates on regions of high similarity

Page 6: Pairwise Sequence Alignment Exercise 2

Three types of changes:Three types of changes:1.1. SubstitutionSubstitution – a replacement of one (or more) – a replacement of one (or more)

sequence letter by another:sequence letter by another:

2.2. InsertionInsertion - an insertion of a letter or several - an insertion of a letter or several letters to the sequence:letters to the sequence:

3.3. DeletionDeletion - deleting a letter (or more) from the - deleting a letter (or more) from the sequence:sequence:

TTAA

Evolutionary changes in sequencesEvolutionary changes in sequences

InsertionInsertion + + DeletionDeletion IndelIndel

AAAAGGAA AAAACCAA

AAGAAG

GAGAAAAA

Page 7: Pairwise Sequence Alignment Exercise 2

Choosing an alignment: Choosing an alignment:

Many Many differentdifferent alignments are possible: alignments are possible:

AAGCTGAATTCGAAAGGCTCATTTCTGA

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Which alignment is better?

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

Page 8: Pairwise Sequence Alignment Exercise 2

Exercise: compute both Exercise: compute both alignment scoresalignment scores

Match: Match: +1+1 Mismatch: Mismatch: -2-2 Indel: Indel: -1-1

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Page 9: Pairwise Sequence Alignment Exercise 2

Scoring systems: accounting for Scoring systems: accounting for biological contextbiological context

Which is true about the scores in a pairwise alignment of nucleotide sequences?

1. Tr > Tv > 0

2. Tr < Tv < 0

3. 0 > Tr > Tv

4. 0 > Tv > Tr

Tr = Transition

Tv = Transversion

Page 10: Pairwise Sequence Alignment Exercise 2

Scoring systems: accounting for Scoring systems: accounting for biological contextbiological context

Which is true about the scores in a pairwise alignment of amino-acid sequences?

1. Asp->Asn > Asp->Glu

2. Arg->His > Ala->Phe

3. Arg->His < Thr->Met

Page 11: Pairwise Sequence Alignment Exercise 2

Substitutions matrices Substitutions matrices

Nucleic acids:Nucleic acids: Transition-transversionTransition-transversion

Amino acids:Amino acids: Evolutionary (empirical data) based: (PAM, Evolutionary (empirical data) based: (PAM,

BLOSUM)BLOSUM) Physico-chemical properties based Physico-chemical properties based

(Grantham, McLachlan)(Grantham, McLachlan)

Page 12: Pairwise Sequence Alignment Exercise 2

PAM matricesPAM matrices Family of matrices PAM 80, PAM 120, Family of matrices PAM 80, PAM 120,

PAM 250PAM 250

The number with a PAM matrix represents the The number with a PAM matrix represents the evolutionary distance between the sequences evolutionary distance between the sequences on which the matrix is basedon which the matrix is based

Greater numbers denote greater distancesGreater numbers denote greater distances

Page 13: Pairwise Sequence Alignment Exercise 2

PAM - limitationsPAM - limitations

Based on only one original datasetBased on only one original dataset

Examines proteins with few differences Examines proteins with few differences (85% identity)(85% identity)

Based mainly on small globular proteins Based mainly on small globular proteins so the matrix is biased so the matrix is biased

Page 14: Pairwise Sequence Alignment Exercise 2

BLOSUM matricesBLOSUM matrices

Different BLOSUMDifferent BLOSUMnn matrices are calculated matrices are calculated independently from BLOCKS (ungapped local independently from BLOCKS (ungapped local alignments)alignments)

BLOSUMBLOSUMnn is based on a cluster of BLOCKS of is based on a cluster of BLOCKS of sequences that share at least sequences that share at least nn percent identity percent identity

BLOSUMBLOSUM6262 represents closer sequences than represents closer sequences than BLOSUMBLOSUM4545

Page 15: Pairwise Sequence Alignment Exercise 2

Substitution matrices exerciseSubstitution matrices exercise

Pick the best substitution matrix (PAM and Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment:BLOSUM) for each pairwise alignment:

Human – chimpHuman – chimp Human - yeastHuman - yeast Human – fishHuman – fish

PAM options: PAM60 PAM120 PAM250

BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80

Page 16: Pairwise Sequence Alignment Exercise 2

PAM Vs. BLOSUMPAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45

More distant sequences

BLOSUM62 for general useBLOSUM62 for general useBLOSUM80 for close relationsBLOSUM80 for close relationsBLOSUM45 for distant relationsBLOSUM45 for distant relations

PAM120 for general usePAM120 for general usePAM60 for close relations PAM60 for close relations PAM250 for distant relationsPAM250 for distant relations

Page 17: Pairwise Sequence Alignment Exercise 2

Gap penaltyGap penalty

AAGCGAAATTCGAACA-G-GAA-CTCGAAC

AAGCGAAATTCGAACAGG---AACTCGAAC

• Which alignment is more likely?

• Which alignment has a higher score?

Page 18: Pairwise Sequence Alignment Exercise 2

Web servers for pairwise alignmentWeb servers for pairwise alignment

Page 19: Pairwise Sequence Alignment Exercise 2

BLAST 2 sequences (bl2Seq) at BLAST 2 sequences (bl2Seq) at NCBI NCBI

Produces the Produces the locallocal alignment of two given alignment of two given sequences using sequences using BLASTBLAST (Basic Local (Basic Local Alignment Search Tool)Alignment Search Tool) engine for local engine for local alignmentalignment

Does not use an exact algorithm but a Does not use an exact algorithm but a heuristicheuristic

Page 20: Pairwise Sequence Alignment Exercise 2

Back to NCBIBack to NCBI

Page 21: Pairwise Sequence Alignment Exercise 2

BLAST – bl2seqBLAST – bl2seq

Page 22: Pairwise Sequence Alignment Exercise 2

Bl2Seq - queryBl2Seq - query

blastnblastn – – nucleotide nucleotide blastpblastp – protein – protein

Page 23: Pairwise Sequence Alignment Exercise 2

Bl2seq resultsBl2seq results

Page 24: Pairwise Sequence Alignment Exercise 2

Bl2seq resultsBl2seq results

MatchMatch DissimilarityDissimilarity GapsGaps SimilaritySimilarity Low Low

complexitycomplexity

Page 25: Pairwise Sequence Alignment Exercise 2

BLAST – programsBLAST – programs

Query: DNA Protein

Database: DNA Protein

Page 26: Pairwise Sequence Alignment Exercise 2

BLAST – BlastpBLAST – Blastp

Page 27: Pairwise Sequence Alignment Exercise 2

Blastp - resultsBlastp - results

Page 28: Pairwise Sequence Alignment Exercise 2

Blastp – results (cont’)Blastp – results (cont’)

Page 29: Pairwise Sequence Alignment Exercise 2

Blast scoresBlast scores::

Bits scoreBits score – A score for the alignment according – A score for the alignment according to the number of similarities, identities, etc.to the number of similarities, identities, etc.

Expected-score (E-value)Expected-score (E-value) –The number of –The number of alignments with the same score one can alignments with the same score one can “expect” to see by chance when searching a “expect” to see by chance when searching a random database of a particular size. The closer random database of a particular size. The closer the e-value is to zero, the greater the confidence the e-value is to zero, the greater the confidence that the hit is really a homologthat the hit is really a homolog

Page 30: Pairwise Sequence Alignment Exercise 2

Blastp – acquiring sequencesBlastp – acquiring sequences

Page 31: Pairwise Sequence Alignment Exercise 2

blastp – acquiring sequencesblastp – acquiring sequences

Page 32: Pairwise Sequence Alignment Exercise 2

Fasta format – multiple sequencesFasta format – multiple sequences>gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH

>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

>gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH

>gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH

>gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH

Page 33: Pairwise Sequence Alignment Exercise 2

Searching for remote homologsSearching for remote homologs

Sometimes BLAST isn’t enoughSometimes BLAST isn’t enough Large protein family, and BLAST only finds Large protein family, and BLAST only finds

close members. We want more distant close members. We want more distant members members

PSI-BLASTPSI-BLAST

Page 34: Pairwise Sequence Alignment Exercise 2

PSI-BLASTPSI-BLAST

PPosition osition SSpecific pecific IIterated terated BLASTBLAST

Regular blast

Construct profile from blast results

Blast profile search

Final results

Page 35: Pairwise Sequence Alignment Exercise 2

PSI-BLASTPSI-BLAST

Advantage:Advantage: PSI-BLAST looks for seq’s PSI-BLAST looks for seq’s that are close to the query, and learns that are close to the query, and learns from them to extend the circle of friendsfrom them to extend the circle of friends

Disadvantage:Disadvantage: if we obtained a WRONG if we obtained a WRONG hit, we will get to unrelated sequences hit, we will get to unrelated sequences (contamination). This gets worse and (contamination). This gets worse and worse each iterationworse each iteration

Page 36: Pairwise Sequence Alignment Exercise 2

PSI-BLASTPSI-BLAST

Which one(s) of the following is/are correct?Which one(s) of the following is/are correct?

1.1. PSI-BLAST is expected to give more hits PSI-BLAST is expected to give more hits than BLASTthan BLAST

2.2. PSI-BLAST is an iterative search methodPSI-BLAST is an iterative search method

3.3. PSI-BLAST is faster than BLASTPSI-BLAST is faster than BLAST

4.4. Each iteration of PSI-BLAST can only Each iteration of PSI-BLAST can only improve the results of the previous improve the results of the previous iterationiteration

Page 37: Pairwise Sequence Alignment Exercise 2

BLAST – PSI-BlastBLAST – PSI-Blast

Page 38: Pairwise Sequence Alignment Exercise 2

PSI-Blast - resultsPSI-Blast - results