august 26, 2011 biochemistry 201 david worthylake, 7152 meb, x5176

42
August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176 Sequence Alignments and Database Sequence Alignments and Database Searching Searching

Upload: keelia

Post on 01-Feb-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Sequence Alignments and Database Searching. August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176. Protein A of interest to you. ornithine decarboxylase?. Why compare protein sequences?. Significant sequence similarities allow associations based upon known functions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

August 26, 2011

Biochemistry 201

David Worthylake, 7152 MEB, x5176

Sequence Alignments and Sequence Alignments and Database SearchingDatabase Searching

Sequence Alignments and Sequence Alignments and Database SearchingDatabase Searching

Page 2: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Why compare protein sequences?

Significant sequence similarities allow associations based upon

known functions.

Protein A of interest to you.

ornithine decarboxylase?

Page 3: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Extracted from ISMB2000 tutorial,WR Pearson, U. of Virginia

Possible for proteins to possess high sequence identity/ similarity between segments and not be homologous

1) Homologous proteins (ie having similar structures) need not posess high sequence identity / similarity: S. griseus trypsin 36% S. griseus protease A 25%

Homology vs. similarity

2) cytochrome c4, has reasonably high sequence identity/ similarity with trypsins, yet does not have common ancestor, nor common fold.

3) subtilisin has same spatial arrangement of active site residues, but is not related to trypsins

Page 4: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Homologous proteins always share a common three-dimensional fold, often with common active or binding site.

Proteins that share a common ancestor are homologous.

Proteins that possess >25% identity across entire length generally will be homologous (but there can be exceptions).

Proteins with <20% identity are not necessarily not homologous

Homology vs. similarity

Page 5: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Homologous sequences are either: 1) orthologous, or 2) paralogous

•For orthologs - sequence divergence and evolutionary relationships will agree.•For paralogs - no necessary linkage between sequence divergence and speciation.

Extracted from ISMB2000 tutorial,WR Pearson, U. of Virginia

Orthologous cyctochrome c isozymes

Hemoglobins contain both orthologs and paralogs

Orthologs - sequence differences arises from divergence in different species (i.e. cyctochrome c) Paralogs - sequence differences arise after gene duplication within a given species (i.e. GPCRs, hemoglobins)

Homology vs. similarity

Page 6: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

We’ve all seen and/or used sequence alignments, but howare they accomplished?

Sequence searches and alignments using DNA/RNA are usually not asinformative as searches and alignments using protein sequences. However.DNA/RNA searches are intuitively easier to understand:

AGGCTTAGCAAA........TCAGGGCCTAATGCG|||||||| ||| ||||||||||| |||AGGCTTAGGAAACTTCCTAGTCAGGGCCTAAAGCG

The above alignment could be scored giving a “1” for each identical nucleotide,A zero for a mismatch, and a -4 for “opening a “gap” and a -1 for each extensionof the gap. So score = 25 – 11= 14

Page 7: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Protein sequence alignments are much more complicated. How would this alignment be scored?

ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH| | | | | ||| | | || |||AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH

Unlike nucleotide sequence alignments, which are either identical ornot identical at a given position, protein sequence alignments include“shades of grey” where one might acknowledge that a T is sort of equivalent to an S etc. But how equivalent? What number would youassign to an S-T mismatch? And what about gaps? Since alanine isa common amino acid, couldn’t the A-A match be by chance? SinceTrp and Cys are uncommon, should those matches be given higherscores?

Do you see that accurately aligning sequences and accuratelyfinding related sequences are the same problem?

Page 8: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Extracted from ISMB2000 tutorial,WR Pearson, U. of Virginia

Global scores require alignment of entire sequence length.Cannot be used to detect relationships between

domains in mosaic proteins.

Global versus local alignments

Local alignments are necessary to detect domains within mosaic proteins, internal duplications.

Needleman-WunschBLAST

Page 9: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Databases

Nucleotide: GenBank (NCBI), EMBL, DDBJ (Japan)

Protein: SwissProt, TrEMBL, GenPept(GenBank)

Huge databases – share much information. Many entries linked to other databases (e.g. PDB). SwissProt small but well “curated”. NCBI non-redundant(nr) protein sequence database is very large but sometimes confusing.

These databases can be searched in a number of ways. Can search only human or metazoan sequences. Can eliminate entries made before a givenDate. Etc.

Page 10: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

We’ve got the lots of sequences, now how do we score/search? First, we need a way to assign numbers to “shades of grey” matches.

Genetic code scoring system – This assumes that changes in proteinsequence arise from mutations. If only one point mutation is neededto change a given AA to another (at a specific position in alignment),the two amino-acids are more closely related than if two point mutationswere required.

Physicochemical scoring system – a Thr is like a Ser, a Trp is not likean Ala……

These systems are seldom used because they have problems. Why try to second guess Nature? Since there are many related sequences out there, we can look at some (trusted) alignments to SEE which sub-stitutions have occurred and the frequency with which they occur.

Page 11: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

PAM (Point Accepted Mutation) matrices

• Are derived from studying global alignments of well-characterized protein families. • PAM1 = only 1% of residues has changed (ie short evolutionary distance) • Raise this to 250 power to get 250% change of two sequences (greater evolutionary distance), or about 20% sequence identity. • Therefore, a PAM 30 would be used to analyze more closely related proteins, a PAM 400 is used for finding and analyzing very distantly related proteins. • PAMx = PAM1x

(Dayhoff, Atlas of Protein Sequence and Structure, vol. 5, suppl 3, p 345-352)

Page 12: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Are derived from studying local alignments (blocks) of sequences from related proteins that differ by no more than X%. (Henikoff & Henikoff, PNAS ‘92, 89, p10915-10919) 1)In other words, one might use the portions of aligned sequences from related proteins that have no more than 62% identity (in the portions or blocks) to derive the BLOSUM 62 scoring matrix.

2)One might use only the blocks that have <80% identity to derive the BLOSUM 80 matrix.

Block substitution matrices (BLOSUM)

3) BLOSUM and PAM substitution matrices have the opposite effects:

a)The higher the number of the BLOSUM matrix (BLOSUM X), the more closely related proteins you are looking for.

a)The higher the number of the PAM matrix (PAM X), the more distantly related proteins you are looking for.

Page 13: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Extracted from ISMB2000 tutorial,WR Pearson, U. of VirginiaPAM250 matrix

Note that for identical matches, scores vary depending upon observed frequencies. That is, rare amino acid (i.e. Trp) that are not substituted have high scores; frequently occuring amino acids (i.e. Ala) are down-weighted because of the high probability of aligning by chance.

Amino acid substitution matrices

•Negative scores - unlikely substitutions

Page 14: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Gap penalties – Intuitively one recognizes that there should be a penaltyfor introducing (requiring) a gap during identification/alignment of a givensequence. But if two sequences are related, the gaps may well be locatedin loop regions which are more tolerant of mutational events and probablyhave little impact on structure. Therefore, a new gap should be penalized, but extending an existing gap should be penalized very little.

Filtering – many proteins and nucleotides contain simple repeats or regions of low sequence complexity. These must be excluded from searches and alignments. Why?

Significance of a “hit” during a search - More important than an arbitraryscore is an estimation of the likelihood of finding a hit through pure chance. Ergo the “Expectation value” or E-value. E-values can be as low as 0 forIdentical (long) match (e.g. a 250 AA protein finding itself in search).

Page 15: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

E-valueSo, for sufficiently large databases (so one can apply statistics):

E = Kmne-S

m- query lengthn - database lengthE - expectation valueK - scale factor for query sequence (AA composition) - scale factor for scoring system (e.g. PAM250)S - score, dependent on substitution matrix, gap-penalties, etc.

Doubling either m or n doubles number of sequences returned with a given expectation value; similarly, double the score and expectation value decreases exponentiallyExpectation value - probability that given score will occur by chance given the query AND database “strings”

Page 16: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Extracted from ISMB2000 tutorial,WR Pearson, U. of Virginia

• Must account for increases in similarity score due to increase in sequence length searched.

• Scaling the sequence length allows the detection of distantly-related sequences.

• solids = individual sequences• opens = average score

Removing length bias from scoring statistics

Page 17: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Extracted from ISMB2000 tutorial,WR Pearson, U. of Virginia

Global scores require alignment of entire sequence length.Cannot be used to detect relationships between

domains in mosaic proteins.

Global versus local alignments

Local alignments are necessary to detect domains within mosaic proteins, internal duplications.

Page 18: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

1) Break query up into “words” e.g. ASTGHKDLLV AST

WORDS STG TGH2) Generate expanded list of words that would match with (i.e. PAM250) a score of at least T – You’re acknowledging that you may not have any

exact matches with original list of words.

3) Use expanded list of words to search database for exact matches.

4) Extend alignments from where word(s) found exact match.

Basic local alignment search tool (BLAST)

Heuristic algorithm – Uses guesses. Increases speed without a greatloss of accuracy (BLASTP, FASTA (local Hueristic), S-W (local rigorous),Needleman-Wunsch (global, rigorous)

Page 19: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Pictorial representation of BLASTp algorithm (Basic Local Alignment Search Tool proteins).

Query sequence

Words (they overlap)

Expand list of words (each word (left) has “similar” words)

Search database, find hits, extend alignments

Report sorted list of hits

Page 20: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Nucleotide BLAST looks for exact matches

Protein BLAST (BLASTp) requires two hits

GTQITVEDLFYNI

SQI YYN

ATCGCCATGCTTAATTGGGCTT

CATGCTTAATT

neighborhood words

exact word match

one hit

two hits

NC

BI

BLAST

Page 21: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

FASTA

Instead of breaking up query into words (and then generating a list of similar words), find all sequences in the database that containshort sequences that are exact or nearly exact matches for sequenceswithin the query. Score these and sort. Sort of reverse methodology toBLAST

Que

ry s

eque

nce

Database sequence

Page 22: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176
Page 23: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Protein database

Page 24: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176
Page 25: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

mouse over

Page 26: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

link to entrez

sorted by e values

5 x e-98

Gene

S’ = λS - lnK

ln2

E = mn 2-S’

= Kmne-λS

Page 27: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Identifying distant homologies (use several different query sequences)

Examine output carefully. A lack of statistical significance doesn’t necessarily mean a lack of homology!

Extracted from ISMB2000 tutorial,WR Pearson, U. of Virginia

Also remember - If A is homologous to B, and B to C, then A should be homologous to C

Page 28: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

PSI-BLASTp

Very sensitive, but must not include a non-member sequence!

1) Regular BLASTp search2) Sequences above a certain threshold (< specified E-value) are included. Assumed to be related proteins. This group of sequences is used to define a “profile” that contains the sequence “essence” of

the protein family.3) Now with the important sequence positions highlighted, can look for more distantly related sequences that should still have the “essence” of the protein family.4) Inclusion of more distantly related sequences modifies the profile further (further defines the essence) and allows for identification of even more distantly related sequences. Etc.

Note: PSI-BLASTp may find and then subsequently lose a homologous sequence during the iteration process! “Drifting” of the program, would be the gradual loss of distant homologs during the iteration process.

Page 29: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOHMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYYVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNGRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTHVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY

e value cutoff for inclusion

PSI-BLAST: initial run

NC

BI

Page 30: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

PSI-BLAST: initial run N

CB

I

Page 31: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

PSI-BLAST: first PSSM search

Note: These E-values are different fromusual BLASTp because of position-specific scoring matrix (later).

Page 32: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

iteration 1

iteration 2

PSI-Blast ofhuman Tiam1

PSI-BLAST: importance of original query (remember, if A is like B….)

Page 33: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

iteration 2

iteration 1

iteration 3

Ras-binding domains

PSI-Blast ofmouse Tiam2 (~90% identity with human Tiam1)

PSI-BLAST: importance of original query

Page 34: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Active site serineWeakly conserved serine

Position specific scoring matrix (PSSM)(learning from your “hits”)

NC

BI

Page 35: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine scored differently in these two positions

Active site nucleophile

Position specific scoring matrix (PSSM)

Page 36: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Multiple sequence alignments (MSAs)

In this example, an MSA is used to identify regions of high sequence conservation presumably reflecting structural and functional constraints. Useful for delimiting known domains and potential new functional regions (e.g. the Ras-binding domain in yellow and the blue box of currently unknown function).

Page 37: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Fun with MSA...

MSA used to locate functional residues and domain boundaries in homologs of Dbl-proteins with known structure (Dbs and Tiam1).

Red amino acids directly interact with GTPases. Blue residues directly interact with phosphoinositides.

Page 38: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Phyre uses a 3-dimensional Position Sensitive Scoring Matrix!

Page 39: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Hidden Markhov Models – devices for generating folds

HMM is created using some examples and general rules.

The examples are defined folds.

For instance, 60 PH domains might be used to create an HMM for PH domains.

An HMM can assign a probability that it generated a givensequence (e.g. does this sequence represent a PH domain?)

Page 40: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

A very simple HMM for a protein with 4 amino acidsThe square boxes are called “match states” – these will emit a amino acid with a set probability for each AA. Diamond boxes are for insertions between match states, and the circles are for deletions of match states.

There are probabilities associated with all of the arrows. There aremany possible paths through the Model! These are the “rules”learned from the examples (e.g. PH domains you used).

Random transitions through the Model and emissions from the statesare guided by probabilities. All you see at the end is the generated sequence. The model that generated the sequence is “hidden”. But the resulting sequence is related to those sequences used to construct themodel. Again, IT IS POSSIBLE TO CALCULATE THE PROBABILITYTHAT A GIVEN SEQUENCE WAS GENERATED BY THE MODEL!

Page 41: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

Homology If two proteins are homologous, they have a common fold and a common ancestor

If two proteins have >25% identity across their entire length, they are likely to be homologs. However, sometimes true homologs have quite low sequence identity!

Orthologs Homologous (and equivalent) proteins from different species.Arise from speciation.

Paralogs Homologous (and equivalent) proteins found in same species.Divergence of sequences NOT from speciation (gene duplication).

Alignments How to score?Minimum # of mutations?, Physicochemical properties (as perceived by us)?, Or learn from nature?

Scoring schemes PAM, BLOSUM

What you should know

BLAST vs. PSI-BLAST

Algorithms - BLASTp, FASTA, Smith-Watermann, Needleman-Wunsch

Page 42: August 26, 2011 Biochemistry 201 David Worthylake, 7152 MEB, x5176

E values What it means in words

E = Kmne -λS

Alignment algorithms BLAST (Local, heuristic)FASTA (Local, heuristic)Smith-Waterman (Local, rigorous)Needleman-Wunsch (Global, rigorous)

Why use local alignment algorithm?

Why use global alignment?