identification of protein domains eden dror menachem schechter computational biology seminar 2004

68
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Upload: arthur-pearson

Post on 11-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Identification of Protein DomainsEden DrorMenachem Schechter

Computational Biology Seminar 2004

Page 2: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Overview

• Introduction to protein domains.– Classification of homologs.

• Representing a domain.– PSSM– HMM

• Internet resources– Pfam– SMART– PROSITE– InterPro

• Research example.

Page 3: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Protein domains

• A discrete portion of a protein assumed to fold independently, and possessing its own function.

• Mobile domain (“module”): a domain that can be found associated with different domain combinations in different proteins.

Page 4: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Protein domains

• The assumption: The domain is the fundamental unit of protein structure and function.

• Protein family – all proteins containing a specific domain.

Page 5: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

What can we learn from them?

• Common ancestors & homology information of a set of proteins.

• Homology can induce properties of a protein like functionality & localization.

• Therefore, domains can be used to classify a new protein to a family, inferring functionality.

Page 6: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Classification of homologs

• Homology is not a sufficiently well-defined term to describe the evolutionary relationships between genes.

• Homologous genes can be derived by two major ways: – Gene duplication (in the same species).– Speciation (splitting of one species into

two).

Page 7: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Classification of homologs

Page 8: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Classification of homologs

• Orthologs – Two genes from two different species that derive from a single gene in the last common ancestor of the species.

• Paralogs – Two genes that derive from a single gene that was duplicated within a genome.

Page 9: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Classification of homologs

para

para

ortho

ortho

Page 10: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Classification of homologs

• Inparalogs - paralogs that evolved by gene duplication after the speciation event.

• Outparalogs - paralogs that evolved by gene duplication before the speciation event.

Page 11: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Classification of homologs

out-para

In-para

In-para

When comparing human with worm

Page 12: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

What can we learn from them?

• Ortholog proteins are evolutionary, and typically functional counterparts in different species.

• Paralog proteins are important for detecting lineage-specific adaptations.

• Both of them can reveal information on a specific species or a set of species.

Page 13: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Protein domains – summary

• By identifying domains we can:

– infer functionality & localization of a protein.

– Learn on a specific species.– Learn on a set of species as a group.

Page 14: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Domain representation

• Different methods to represent (model) domains:

• Patterns (regular expressions).• PSSM (Position specific score matrix).• HMM (Hidden Markov model).

Page 15: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

PSSM

• Position specific score matrix

• Score matrix representing the score for having each amino acid in a given position in a specific sequence.

• Based on the independent probabilities P(a|i) of observing amino acid a in position i.

Page 16: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

PSSM: Example

Page 17: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

PSSM: Identifying a domain

• Given a sequence and a PSSM:

• Run over all positions.• Score each sub-sequence according to

the matrix.

Page 18: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Hidden Markov Model

• Markov model: a way of describing a process that goes through a series of states.

• Each state has a probability of transitioning to the other states.

• xi is a random variable of state.x1 x2 x3 x4

Page 19: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Markov Model

• Example:• States are {0,1}

x1 =0 x2 =0 x3 =0 x4 =0

x1 =1 x2 =0 x3 =0 x4 =1

x1=0 x2=1 x3 =1 x4 =1

Page 20: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Markov Model

)|(

8.02.0

4.06.0)(

1 ixjxPa

aA

kkij

ij

• Transition matrix:

x1 x2 x3 x4

x

Page 21: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Markov Model

• State transition example:• States are the nucleotides A, T, G, C.

Page 22: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Hidden Markov Model

• Hidden Markov model:• Each state x emits an output y, at a

specific probability.• We only know the output

(observations).• Thus, the states are hidden.

y1 y2 y3 y4

x1 x2 x3 x4

Page 23: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Hidden Markov Model

• Example: states are {0,1}, output {0,1}

y1 =1 y2 =1 y3 =0 y4 =0

x1 =0 x2 =1 x3 =1 x4 =1

y1 =1 y2 =0 y3 =1 y4 =0

x1 =1 x2 =0 x3 =0 x4 =1

Page 24: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Hidden Markov Model

y1 y2 y3 y4

x1 x2 x3 x4

)|(

15.085.0

9.01.0)(

ixjyPb

bB

kkij

ij

• Emission matrix:

x

y

Page 25: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: What can we do with it?

• Given (A, B):• Probability of given states and outputs

)|()|()|()()( 22121112121 xyPxxPxyPxPyyyxxxP nn

nxx

nnn yyyxxxPyyyP

1

)()( 212121

)|(max 2121 nn yyyxxxP

• Most likely sequence of states that generated a given output sequence

• Probability of a given output sequence

Page 26: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: What can we do with it?

• Learning:

• Given state and output sequences calculate the most probable (A, B).

• Easy when the states are known.

• Otherwise: use a training algorithm.

Page 27: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Profile HMM

• Use HMM to represent sequence families.

• A particular type of HMM suited to modeling multiple alignments.

• (Assume we have a multiple alignment).

Page 28: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Trivial profile HMM

• We begin with ungapped regions.

• Each position corresponds to a state.• Transitions are of probability 1.

Page 29: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Trivial profile HMM

• Let ei(a) be the independent probability of observing amino acid a in position i.

• The probability of a new sequence x, according to the model:

)()|(1

ii

N

ixeMxP

Page 30: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Trivial profile HMM

• We can score the sequence x:

• Where q indicates the probability under a random model.

ix

iiN

i q

xeS

)(log

1

Page 31: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: Trivial profile HMM

• Consider the values

• They behave like elements in a score matrix.

• The trivial profile HMM is equivalent to a PSSM.

ix

ii

q

xe )(log

Page 32: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: profile HMM

• Let’s untrivialize by allowing for gaps: insertions and deletions.

• Start off with the PSSM HMM.

Page 33: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: profile HMM

• Handling insertions:

• Introduce new states Ij – match insertions after position j.

• These states have random emission probabilities.

Page 34: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: profile HMM

• The score of a gap of length k:

jjjjjj IIMIIM akaa log)1(loglog1

Page 35: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: profile HMM

• Handling deletions:

• Introduce silent states Dj.

• These states do not emit.

Page 36: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

HMM: profile HMM

• The complete profile HMM:

Page 37: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Internet resources

• Databases of protein families.• Family information and identification.

• Considerations:– Type of representation (pattern, PSSM,

HMM).– Choice of seed multiple alignment proteins.– Quality control.– Database features (links, annotations,

views).– Database Specificity (organism, functions).

Page 38: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Pfam: Home

Page 39: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Pfam

• Protein families database of alignments and HMMs

• Uses profile-HMMs to represent families.

• For each family in Pfam you can:– Look at multiple alignments – View protein domain architectures – Examine species distribution – Follow links to other databases – View known protein structures

Page 40: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Pfam: Databases

2 databases:• Pfam-A – curated multiple alignments.

– Grows slowly.– Quality controlled by experts.

• Pfam-B – automatic clustering (ProDom derived).– Complements Pfam-A.– New sequences instantly incorporated.– Unchecked: false positives, etc.

Page 41: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Pfam: Features

• Search by: Sequence, keyword, domain, taxonomy.

• Browsing by family or genome.

• Evolutionary tree

Page 42: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Pfam: Construction

• Source of seed alignments:– Pfam-B families.– Published articles.– 'domain hunting' studies.– occasionally using entries from other

databases (e.g. MEROPS for peptidases).

Page 43: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Pfam: Domain information

Page 44: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Pfam: Domain organization

Page 45: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Pfam: Multiple alignment

Page 46: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Pfam: HMM logo

Page 47: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Pfam: Species distribution

Page 48: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Pfam: Genome comparison

Page 49: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

PROSITE

• Database of protein families.

• Matching according to simple patterns or PSSM profiles.

• Browsing all proteins of a specific family.

• Latest release knows 1696 protein families.

Page 50: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

PROSITE: Features

• Comprehensive domain documentation.• All profile matches checked by experts.• Specificity/sensitivity:• Specificity: true-pos/all-pos• Sensitivity: true-pos/(true-pos + false-

neg)

Page 51: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

PROSITE: Example

• Specificity of Zinc finger C2H2 type domain

Page 52: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

SMART

Page 53: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

SMART

• Simple Modular Architecture Research Tool

• Identification and annotation of genetically mobile domains and the analysis of domain architectures.

• SMART consists of a library of HMMs.

• Knows 665 HMMs to date.

Page 54: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

SMART: Features

• finding proteins containing specific domains i.e. of the same family

• Function prediction• Sub-cellular localization• Binding partners• Architecture• Alternative splicing information• Orthology information

Page 55: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

SMART: Domain selection example

Tyrosine kinase (TyrKc) AND Transmembrane region (TRANS)

Page 56: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

InterPro

• InterPro combines 9 other databases such as SMART, Pfam, Prodom and more.

• Queries can use many different methods (as the other databases use different methods).

• However, thresholds are predefined and cannot be changed for those methods.

Page 57: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

InterPro

• Provides more results, but can sometimes be redundant.

• Coverage statistics:• 93% of Swiss-Prot v42.5 –

128540 out of 138922 proteins• 81% of TrEMBL v25.5 –

819966 out of 1013263 proteins

Page 58: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

InterPro: Features

• Searching by Protein/DNA sequences

• Finding domains & homologs

• List of InterPro entries of type: – Family– Domain– Repeat– PTM- Post Transcriptional modifications– Binding Site– Active Site– Keyword

Page 59: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

InterPro: Example

• Kringle domain

Page 60: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Research Example: Introduction

• Goal: The systematic identification of novel protein domain families.

• Using computational methods.

Page 61: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Research Example: Method

Derive set of 107 nuclear domains

extract proteins

Extract unannotated regions

Cluster sequences

Take longest member

PSI-BLAST

Investigate homologous regions

Manual confirmation

Page 62: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Research Example: Results

• 28 New Domains identified:

• 15 domains in diverse contexts, in different species.

• 3 domains species specific.• 7 domains with weak similarity to

previously described domains.• 3 extension domains.

Page 63: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Predictions of Function

• On the basis of reports in literature and/or occurrence with other identified domains, functional features can be predicted for our novel domain families.

• Examples:– Chromatin binding– Protein Interaction– Predicted sub-cellular localization

Page 64: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Predictions of Function:Chromatin-Binding example

• The novel domain CSZ is contained in protein SPT6, which regulates transcription via chromatin structure modification.

• SPT6 has a histone-binding capability, experimentally confirmed.

• Other domains (S1, SH2) in SPT6 are unlikely to bind histones or chromatin.

• Conclusion: CSZ has a predicted histone binding function.

Page 65: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Predictions of Function:Localization example• Some of the novel domains are only

found within proteins from the initial set of nuclear domains.

• This predicts that these domains have a nuclear function.

• The other domains are likely to have roles in both nucleus and cytoplasm.

Page 66: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Conclusion

• Domains are the functional units of proteins.• Identifying a domain within a new protein may

teach us much about it.

• There are several types of models to represent domains.

• These models can also be used to identify the domain they represent.

• Many Internet databases available to catalogue and identify families.

• Protocol to identify new domains using old ones.

Page 67: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Resources

• Pfam:http://www.sanger.ac.uk/Software/Pfam/

• SMART: http://smart.embl-heidelberg.de/

• PROSITE:http://www.expasy.org/prosite/

• InterPro:http://www.ebi.ac.uk/interpro/

Page 68: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

The End