pairwise and multiple sequence alignment lesson 2

59
Pairwise and Pairwise and Multiple Multiple Sequence Sequence Alignment Alignment Lesson 2 Lesson 2

Post on 20-Jan-2016

244 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Pairwise and Multiple Sequence Alignment Lesson 2

Pairwise and Pairwise and Multiple Multiple

Sequence Sequence AlignmentAlignment

Lesson 2Lesson 2

Page 2: Pairwise and Multiple Sequence Alignment Lesson 2

|| || ||||| ||| || || |||||||||||||||||||MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…

ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA

MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

MotivationMotivation

Page 3: Pairwise and Multiple Sequence Alignment Lesson 2

What is sequence alignmentWhat is sequence alignment??

Alignment: Alignment: Comparing two (pairwise) or Comparing two (pairwise) or more (multiple) sequences. Searching for more (multiple) sequences. Searching for a series of identical or similar characters in a series of identical or similar characters in the sequences.the sequences.

MVNLTSDEKTAVLALWNKVDVEDCGGE|| || ||||| ||| || || ||MVHLTPEEKTAVNALWGKVNVDAVGGE

Page 4: Pairwise and Multiple Sequence Alignment Lesson 2

Why perform a pairwise sequence Why perform a pairwise sequence alignment?alignment?

e.g., pe.g., predicting characteristics of a protein – redicting characteristics of a protein –

premised on:premised on:

similar sequence (or structure)similar sequence (or structure)

similar functionsimilar function

Finding homology between two sequences

Page 5: Pairwise and Multiple Sequence Alignment Lesson 2

Local vs. GlobalLocal vs. Global

Local alignmentLocal alignment – finds regions of high – finds regions of high similarity in similarity in partsparts of the sequences of the sequences

Global alignmentGlobal alignment – finds the best alignment – finds the best alignment across the across the entireentire two sequences two sequences

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN CDRYYQ

Page 6: Pairwise and Multiple Sequence Alignment Lesson 2

Three types of nucleotide changes:Three types of nucleotide changes:1.1. SubstitutionSubstitution – a replacement of one (or more) – a replacement of one (or more)

sequence characters by another:sequence characters by another:

2.2. InsertionInsertion - an insertion of one (or more) - an insertion of one (or more) sequence characters:sequence characters:

3.3. DeletionDeletion – a deletion of one (or more) sequence – a deletion of one (or more) sequence characters:characters:

TTAA

Evolutionary changes in sequencesEvolutionary changes in sequences

InsertionInsertion + + DeletionDeletion IndelIndel

AAAAGGAA AAAACCAA

AAGAAG

GAGAAAAA

Page 7: Pairwise and Multiple Sequence Alignment Lesson 2

Choosing an alignment: Choosing an alignment:

Many Many differentdifferent alignments between two alignments between two sequences are possible:sequences are possible:

AAGCTGAATTCGAAAGGCTCATTTCTGA

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

How do we determine which is the best alignment?

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

. . .

Page 8: Pairwise and Multiple Sequence Alignment Lesson 2

Toy exerciseToy exercise

Match: Match: +1+1 Mismatch: Mismatch: -2-2 Indel: Indel: -1-1

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Compute the scores of each of the following alignments using this naïve scoring scheme

Scoring scheme:11--22--22--22

--2211--22--22

--22--2211--22

--22--22--2211

A

C

G

T

A C G T

Substitution matrix

Gap penalty (opening = extending)

Page 9: Pairwise and Multiple Sequence Alignment Lesson 2

Substitution matrices: accounting Substitution matrices: accounting for biological contextfor biological context

Which best reflects the biological reality regarding nucleotide mismatch penalty?

1. Tr > Tv > 0

2. Tv > Tr > 0

3. 0 > Tr > Tv

4. 0 > Tv > Tr

Tr = Transition

Tv = Transversion

Page 10: Pairwise and Multiple Sequence Alignment Lesson 2

Scoring schemes: accounting for Scoring schemes: accounting for biological contextbiological context

Which best reflects the biological reality regarding these mismatch penalties?

1. Arg->Lys > Ala->Phe

2. Arg->Lys > Thr->Asp

3. Asp->Val > Asp->Glu

Page 11: Pairwise and Multiple Sequence Alignment Lesson 2

PAM matricesPAM matrices Family of matrices PAM 80, PAM 120, PAM 250, …Family of matrices PAM 80, PAM 120, PAM 250, …

The number with a PAM matrix (the The number with a PAM matrix (the nn in PAM in PAMnn) ) represents the evolutionary distance between the represents the evolutionary distance between the sequences on which the matrix is basedsequences on which the matrix is based

The (The (iithth,,jjthth)) cell in a PAMcell in a PAMnn matrix denotes the probability matrix denotes the probability that amino-acid that amino-acid ii will be replaced by amino-acid will be replaced by amino-acid j j in in time time nn:: P Pii→→j,nj,n

Greater Greater nn numbers denote greater distances numbers denote greater distances

Page 12: Pairwise and Multiple Sequence Alignment Lesson 2

PAM - limitationsPAM - limitations

Based on only one original datasetBased on only one original dataset

Examines proteins with few differences Examines proteins with few differences (85% identity)(85% identity)

Based mainly on small globular proteins Based mainly on small globular proteins so the matrix is biased so the matrix is biased

Page 13: Pairwise and Multiple Sequence Alignment Lesson 2

BLOSUM matricesBLOSUM matrices Different BLOSUMDifferent BLOSUMnn matrices are calculated matrices are calculated

independently from BLOCKS (ungapped, manually independently from BLOCKS (ungapped, manually created local alignments)created local alignments)

BLOSUMBLOSUMnn is based on a cluster of BLOCKS of is based on a cluster of BLOCKS of sequences that share at least sequences that share at least nn percent identity percent identity

The (The (iithth,,jjthth)) cell in a BLOSUM matrix denotes the log of cell in a BLOSUM matrix denotes the log of odds of the observed frequency and expected frequency odds of the observed frequency and expected frequency of amino acids of amino acids ii and and j j in the same position in the data: in the same position in the data: log(log(PPijij//qqii**qqjj))

Higher Higher nn numbers denote higher identity between the numbers denote higher identity between the sequences on which the matrix is basedsequences on which the matrix is based

Page 14: Pairwise and Multiple Sequence Alignment Lesson 2

PAM Vs. BLOSUMPAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45

More distant sequences

BLOSUM62 for general useBLOSUM62 for general useBLOSUM80 for close relationsBLOSUM80 for close relationsBLOSUM45 for distant relationsBLOSUM45 for distant relations

PAM120 for general usePAM120 for general usePAM60 for close relations PAM60 for close relations PAM250 for distant relationsPAM250 for distant relations

Page 15: Pairwise and Multiple Sequence Alignment Lesson 2

Substitution matrices exerciseSubstitution matrices exercise

Pick the best substitution matrix (PAM and Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment:BLOSUM) for each pairwise alignment:

Human – chimpHuman – chimp Human - yeastHuman - yeast Human – fishHuman – fish

PAM options: PAM60 PAM120 PAM250

BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80

Page 16: Pairwise and Multiple Sequence Alignment Lesson 2

Substitution matrices Substitution matrices

Nucleic acids:Nucleic acids: Transition-transversionTransition-transversion

Amino acids:Amino acids: Evolutionary (empirical data) based: (PAM, Evolutionary (empirical data) based: (PAM,

BLOSUM)BLOSUM) Physico-chemical properties based Physico-chemical properties based

(Grantham, McLachlan)(Grantham, McLachlan)

Page 17: Pairwise and Multiple Sequence Alignment Lesson 2

Gap penaltyGap penalty

AAGCGAAATTCGAACA-G-GAA-CTCGAAC

AAGCGAAATTCGAACAGG---AACTCGAAC

• Which alignment has a higher score?

• Which alignment is more likely?

Page 18: Pairwise and Multiple Sequence Alignment Lesson 2

Pairwise alignment algorithm matrix Pairwise alignment algorithm matrix representation: representation: formulationformulation

V[i,j] = value of the optimal alignment between S1[1…i] and S2[1…j]

V[i,j] + S(S1[i+1],S2[j+1])

V[i+1,j+1] = max V[i+1,j] + S(gap)

V[i,j+1] + S(gap)

V[i,j]V[i,j]V[i+1,j]V[i+1,j]

V[i,j+1]V[i,j+1]V[i+1,j+1]V[i+1,j+1]

2 sequences: S1 and S2 and a Scoring scheme: match = 1, mismatch = -1, gap = -2

Page 19: Pairwise and Multiple Sequence Alignment Lesson 2

Pairwise alignment algorithm matrix Pairwise alignment algorithm matrix representation: representation: initializationinitialization

0

A 1

G 2

C 3

0 0 -2 -4 -6

A 1 -2

A 2 -4

A 3 -6

C 4 -8

S2S1

Match = 1Mismatch = -1Indel (gap) = -2

Scoring scheme:

Page 20: Pairwise and Multiple Sequence Alignment Lesson 2

Pairwise alignment algorithm matrix Pairwise alignment algorithm matrix representation: representation: filling the matrixfilling the matrix

Match = 1Mismatch = -1Indel (gap) = -2

Scoring scheme:

0

A 1

G 2

C 3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

S2S1

Page 21: Pairwise and Multiple Sequence Alignment Lesson 2

Pairwise alignment algorithm matrix Pairwise alignment algorithm matrix representation: representation: trace backtrace back

0

A 1

G 2

C 3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

S2S1

Page 22: Pairwise and Multiple Sequence Alignment Lesson 2

Pairwise alignment algorithm matrix Pairwise alignment algorithm matrix representation: trace backrepresentation: trace back

0

A 1

G 2

C 3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

S2S1

AAAC

AG-C

Page 23: Pairwise and Multiple Sequence Alignment Lesson 2

Assessing the significance of an Assessing the significance of an alignment scorealignment score

AAGCTGAATTC-GAAAGGCTCATTTCTGA-

AAGCTGAATTCGAAAGGCTCATTTCTGA

AGATCAGTAGACTAGAGTAGCTATCTCT

28.0

AGATCAGTAGACTA---------GAGTAG-CTATCTCT

CGATAGATAGCATAGCATGTCATGATTC

.

.

CGATAGATAGCATA------------------GCATGTCATGATTC

26.0

16.0

True

Random

Page 24: Pairwise and Multiple Sequence Alignment Lesson 2

Web servers for pairwise alignmentWeb servers for pairwise alignment

Page 25: Pairwise and Multiple Sequence Alignment Lesson 2

BLAST 2 sequences (bl2Seq) at BLAST 2 sequences (bl2Seq) at NCBI NCBI

Produces the Produces the locallocal alignment of two given alignment of two given sequences using sequences using BLASTBLAST (Basic Local (Basic Local Alignment Search Tool)Alignment Search Tool) engine for local engine for local alignmentalignment

Does not use an exact algorithm but a Does not use an exact algorithm but a heuristicheuristic

Page 26: Pairwise and Multiple Sequence Alignment Lesson 2

Back to NCBIBack to NCBI

Page 27: Pairwise and Multiple Sequence Alignment Lesson 2

BLAST – bl2seqBLAST – bl2seq

Page 28: Pairwise and Multiple Sequence Alignment Lesson 2

Bl2Seq - queryBl2Seq - query

blastnblastn – – nucleotide nucleotide blastpblastp – protein – protein

Page 29: Pairwise and Multiple Sequence Alignment Lesson 2

Bl2seq resultsBl2seq results

Page 30: Pairwise and Multiple Sequence Alignment Lesson 2

Bl2seq resultsBl2seq results

MatchMatch DissimilarityDissimilarity SimilaritySimilarity GapsGaps Low Low

complexitycomplexity

Page 31: Pairwise and Multiple Sequence Alignment Lesson 2

Query type: AA or DNAQuery type: AA or DNA??

For coding sequences, AA (protein) data For coding sequences, AA (protein) data are betterare better Selection operates most strongly at the protein Selection operates most strongly at the protein

level level →→ the homology is more evident the homology is more evident AA – 20 char’ alphabetAA – 20 char’ alphabet DNA - 4 char’ alphabetDNA - 4 char’ alphabet

lower chance of random homology for AAlower chance of random homology for AA

Page 32: Pairwise and Multiple Sequence Alignment Lesson 2

BLAST – programsBLAST – programs

Query: DNA Protein

Database: DNA Protein

Page 33: Pairwise and Multiple Sequence Alignment Lesson 2

BLAST – BlastpBLAST – Blastp

Page 34: Pairwise and Multiple Sequence Alignment Lesson 2

Blastp - resultsBlastp - results

Page 35: Pairwise and Multiple Sequence Alignment Lesson 2

Blastp – results (cont’)Blastp – results (cont’)

Page 36: Pairwise and Multiple Sequence Alignment Lesson 2

Blast scoresBlast scores:: Bits scoreBits score – A score for the alignment according – A score for the alignment according

to the number of similarities, identities, etc. It has to the number of similarities, identities, etc. It has a standard set of units and is thus independent a standard set of units and is thus independent of the scoring schemeof the scoring scheme

Expected-score (E-value)Expected-score (E-value) –The number of –The number of alignments with the same or higher score one alignments with the same or higher score one can “expect” to see by chance when searching a can “expect” to see by chance when searching a random database with a random sequence of random database with a random sequence of particular sizes. The closer the e-value is to particular sizes. The closer the e-value is to zero, the greater the confidence that the hit is zero, the greater the confidence that the hit is really a homologreally a homolog

Page 37: Pairwise and Multiple Sequence Alignment Lesson 2

Multiple Multiple Sequence Sequence

Alignment (MSA)Alignment (MSA)

Page 38: Pairwise and Multiple Sequence Alignment Lesson 2

Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPGSeq2 VTISCTGTSSNIGS--ITVNWYQQLPGSeq3 LRLSCSSSGFIFSS--YAMYWVRQAPGSeq4 LSLTCTVSGTSFDD--YYSTWVRQPPGSeq5 PEVTCVVVDVSHEDPQVKFNWYVDG--Seq6 ATLVCLISDFYPGA--VTVAWKADS--Seq7 AALGCLVKDYFPEP--VTVSWNSG---Seq8 VSLTCLVKGFYPSD--IAVEWWSNG--

Similar to pairwise alignment BUT n sequences are aligned instead of just 2

Multiple sequence alignment

Each row represents an individual sequenceEach column represents the ‘same’ position

Page 39: Pairwise and Multiple Sequence Alignment Lesson 2

Why perform an MSAWhy perform an MSA??

MSAs are at the heart of comparative genomics studies which seek to study evolutionary histories, functional and structural aspects of sequences, and to understand phenotypic differences between species

Page 40: Pairwise and Multiple Sequence Alignment Lesson 2

Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPGSeq2 VTISCTGTSSNIGS--ITVNWYQQLPGSeq3 LRLSCSSSGFIFSS--YAMYWVRQAPGSeq4 LSLTCTVSGTSFDD--YYSTWVRQPPGSeq5 PEVTCVVVDVSHEDPQVKFNWYVDG--Seq6 ATLVCLISDFYPGA--VTVAWKADS--Seq7 AALGCLVKDYFPEP--VTVSWNSG---Seq8 VSLTCLVKGFYPSD--IAVEWWSNG--

Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPGSeq2 VTISCTGTSSNIGS--ITVNWYQQLPGSeq3 LRLSCSSSGFIFSS--YAMYWVRQAPGSeq4 LSLTCTVSGTSFDD--YYSTWVRQPPGSeq5 PEVTCVVVDVSHEDPQVKFNWYVDG--Seq6 ATLVCLISDFYPGA--VTVAWKADS--Seq7 AALGCLVKDYFPEP--VTVSWNSG---Seq8 VSLTCLVKGFYPSD--IAVEWWSNG--

Multiple sequence alignment

variable conserved

Page 41: Pairwise and Multiple Sequence Alignment Lesson 2

Alignment methodsAlignment methods

There is no available optimal solution for There is no available optimal solution for MSA – all methods are MSA – all methods are heuristics:heuristics:

Progressive/hierarchical alignment Progressive/hierarchical alignment (ClustalX)(ClustalX)

Iterative alignment (MAFFT, MUSCLE)Iterative alignment (MAFFT, MUSCLE)

Page 42: Pairwise and Multiple Sequence Alignment Lesson 2

ABCDE

Compute the pairwise Compute the pairwise alignments for all against alignments for all against

all (10 pairwise alignments).all (10 pairwise alignments).The similarities are The similarities are

converted to distances and converted to distances and stored in a tablestored in a table

First step :compute pairwise distances

Progressive alignmentProgressive alignment

AABBCCDDEE

AA

BB88

CC15151717

DD161614141010

EE3232313131313232

Page 43: Pairwise and Multiple Sequence Alignment Lesson 2

A

D

C

B

E

Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):

• represents the order in which pairs of represents the order in which pairs of sequences are to be alignedsequences are to be aligned• similar sequences are neighbors in the similar sequences are neighbors in the tree tree • distant sequences are distant from distant sequences are distant from each other in the treeeach other in the tree

Second step:build a guide tree

AABBCCDDEE

AA

BB88

CC15151717

DD161614141010

EE3232313131313232The guide tree is imprecise The guide tree is imprecise and is NOT the tree which and is NOT the tree which truly describes the truly describes the evolutionary relationship evolutionary relationship between the sequences!between the sequences!

Page 44: Pairwise and Multiple Sequence Alignment Lesson 2

Third step: align sequences in a bottom up order

A

D

C

B

E

1. Align the most similar (neighboring) pairs

2. Align pairs of pairs

3. Align sequences clustered to pairs of pairs deeper in the tree

Sequence A

Sequence B

Sequence C

Sequence D

Sequence E

Page 45: Pairwise and Multiple Sequence Alignment Lesson 2

Main disadvantages of progressive Main disadvantages of progressive alignmentsalignments

A

D

C

B

E

Sequence A

Sequence B

Sequence C

Sequence D

Sequence E

Guide-tree topology may be considerably wrong

Globally aligning pairs of sequences may create errors that will propagate through to the final result

Page 46: Pairwise and Multiple Sequence Alignment Lesson 2

ABCDE

Iterative alignmentIterative alignment

Guide tree

Pairwise distance table

Iterate until the MSA does not change (convergence)

A

DCB

E

MSA

Page 47: Pairwise and Multiple Sequence Alignment Lesson 2

Blastp – acquiring sequencesBlastp – acquiring sequences

Page 48: Pairwise and Multiple Sequence Alignment Lesson 2

blastp – acquiring sequencesblastp – acquiring sequences

Page 49: Pairwise and Multiple Sequence Alignment Lesson 2

blastp – acquiring sequencesblastp – acquiring sequences

Page 50: Pairwise and Multiple Sequence Alignment Lesson 2

MSA input: multiple sequence Fasta fileMSA input: multiple sequence Fasta file>gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens]MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH

>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

>gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens]MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH

>gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens]MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH

>gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens]MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH

>gi|4885397|ref|NP_005323.1| hemoglobin, zeta [Homo sapiens]MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR

Page 51: Pairwise and Multiple Sequence Alignment Lesson 2

MSA using MSA using ClustalXClustalX

Page 52: Pairwise and Multiple Sequence Alignment Lesson 2

Step1: Load the sequencesStep1: Load the sequences

Page 53: Pairwise and Multiple Sequence Alignment Lesson 2

A little unclear…

Page 54: Pairwise and Multiple Sequence Alignment Lesson 2

Edit Fasta headersEdit Fasta headers……>gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens]MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH

>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

>gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens]MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH

>gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens]MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH

>gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens]MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH

>gi|4885397|ref|NP_005323.1| hemoglobin, zeta [Homo sapiens]MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR

> delta globin

> beta globin

> epsilon globin

> G-gamma globin

> A-gamma globin

> hemoglobin zeta

Page 55: Pairwise and Multiple Sequence Alignment Lesson 2

Step2: Perform alignmentStep2: Perform alignment

Page 56: Pairwise and Multiple Sequence Alignment Lesson 2

MSA and conservation viewMSA and conservation view

Page 57: Pairwise and Multiple Sequence Alignment Lesson 2
Page 58: Pairwise and Multiple Sequence Alignment Lesson 2

Messing-up alignment of HIV-1 env

Page 59: Pairwise and Multiple Sequence Alignment Lesson 2

MSA toolsMSA tools

Progressive:Progressive: CLUSTALX/CLUSTALX/CLUSTALWCLUSTALW

Iterative:Iterative: MUSCLEMUSCLE, , MAFFTMAFFT, , PRANKPRANK