identification of protein domains eden dror menachem schechter computational biology seminar 2004
TRANSCRIPT
Identification of Protein DomainsEden DrorMenachem Schechter
Computational Biology Seminar 2004
Overview
• Introduction to protein domains.– Classification of homologs.
• Representing a domain.– PSSM– HMM
• Internet resources– Pfam– SMART– PROSITE– InterPro
• Research example.
Protein domains
• A discrete portion of a protein assumed to fold independently, and possessing its own function.
• Mobile domain (“module”): a domain that can be found associated with different domain combinations in different proteins.
Protein domains
• The assumption: The domain is the fundamental unit of protein structure and function.
• Protein family – all proteins containing a specific domain.
What can we learn from them?
• Common ancestors & homology information of a set of proteins.
• Homology can induce properties of a protein like functionality & localization.
• Therefore, domains can be used to classify a new protein to a family, inferring functionality.
Classification of homologs
• Homology is not a sufficiently well-defined term to describe the evolutionary relationships between genes.
• Homologous genes can be derived by two major ways: – Gene duplication (in the same species).– Speciation (splitting of one species into
two).
Classification of homologs
Classification of homologs
• Orthologs – Two genes from two different species that derive from a single gene in the last common ancestor of the species.
• Paralogs – Two genes that derive from a single gene that was duplicated within a genome.
Classification of homologs
para
para
ortho
ortho
Classification of homologs
• Inparalogs - paralogs that evolved by gene duplication after the speciation event.
• Outparalogs - paralogs that evolved by gene duplication before the speciation event.
Classification of homologs
out-para
In-para
In-para
When comparing human with worm
What can we learn from them?
• Ortholog proteins are evolutionary, and typically functional counterparts in different species.
• Paralog proteins are important for detecting lineage-specific adaptations.
• Both of them can reveal information on a specific species or a set of species.
Protein domains – summary
• By identifying domains we can:
– infer functionality & localization of a protein.
– Learn on a specific species.– Learn on a set of species as a group.
Domain representation
• Different methods to represent (model) domains:
• Patterns (regular expressions).• PSSM (Position specific score matrix).• HMM (Hidden Markov model).
PSSM
• Position specific score matrix
• Score matrix representing the score for having each amino acid in a given position in a specific sequence.
• Based on the independent probabilities P(a|i) of observing amino acid a in position i.
PSSM: Example
PSSM: Identifying a domain
• Given a sequence and a PSSM:
• Run over all positions.• Score each sub-sequence according to
the matrix.
HMM: Hidden Markov Model
• Markov model: a way of describing a process that goes through a series of states.
• Each state has a probability of transitioning to the other states.
• xi is a random variable of state.x1 x2 x3 x4
HMM: Markov Model
• Example:• States are {0,1}
x1 =0 x2 =0 x3 =0 x4 =0
x1 =1 x2 =0 x3 =0 x4 =1
x1=0 x2=1 x3 =1 x4 =1
HMM: Markov Model
)|(
8.02.0
4.06.0)(
1 ixjxPa
aA
kkij
ij
• Transition matrix:
x1 x2 x3 x4
x
HMM: Markov Model
• State transition example:• States are the nucleotides A, T, G, C.
HMM: Hidden Markov Model
• Hidden Markov model:• Each state x emits an output y, at a
specific probability.• We only know the output
(observations).• Thus, the states are hidden.
y1 y2 y3 y4
x1 x2 x3 x4
HMM: Hidden Markov Model
• Example: states are {0,1}, output {0,1}
y1 =1 y2 =1 y3 =0 y4 =0
x1 =0 x2 =1 x3 =1 x4 =1
y1 =1 y2 =0 y3 =1 y4 =0
x1 =1 x2 =0 x3 =0 x4 =1
HMM: Hidden Markov Model
y1 y2 y3 y4
x1 x2 x3 x4
)|(
15.085.0
9.01.0)(
ixjyPb
bB
kkij
ij
• Emission matrix:
x
y
HMM: What can we do with it?
• Given (A, B):• Probability of given states and outputs
)|()|()|()()( 22121112121 xyPxxPxyPxPyyyxxxP nn
nxx
nnn yyyxxxPyyyP
1
)()( 212121
)|(max 2121 nn yyyxxxP
• Most likely sequence of states that generated a given output sequence
• Probability of a given output sequence
HMM: What can we do with it?
• Learning:
• Given state and output sequences calculate the most probable (A, B).
• Easy when the states are known.
• Otherwise: use a training algorithm.
HMM: Profile HMM
• Use HMM to represent sequence families.
• A particular type of HMM suited to modeling multiple alignments.
• (Assume we have a multiple alignment).
HMM: Trivial profile HMM
• We begin with ungapped regions.
• Each position corresponds to a state.• Transitions are of probability 1.
HMM: Trivial profile HMM
• Let ei(a) be the independent probability of observing amino acid a in position i.
• The probability of a new sequence x, according to the model:
)()|(1
ii
N
ixeMxP
HMM: Trivial profile HMM
• We can score the sequence x:
• Where q indicates the probability under a random model.
ix
iiN
i q
xeS
)(log
1
HMM: Trivial profile HMM
• Consider the values
• They behave like elements in a score matrix.
• The trivial profile HMM is equivalent to a PSSM.
ix
ii
q
xe )(log
HMM: profile HMM
• Let’s untrivialize by allowing for gaps: insertions and deletions.
• Start off with the PSSM HMM.
HMM: profile HMM
• Handling insertions:
• Introduce new states Ij – match insertions after position j.
• These states have random emission probabilities.
HMM: profile HMM
• The score of a gap of length k:
jjjjjj IIMIIM akaa log)1(loglog1
HMM: profile HMM
• Handling deletions:
• Introduce silent states Dj.
• These states do not emit.
HMM: profile HMM
• The complete profile HMM:
Internet resources
• Databases of protein families.• Family information and identification.
• Considerations:– Type of representation (pattern, PSSM,
HMM).– Choice of seed multiple alignment proteins.– Quality control.– Database features (links, annotations,
views).– Database Specificity (organism, functions).
Pfam: Home
Pfam
• Protein families database of alignments and HMMs
• Uses profile-HMMs to represent families.
• For each family in Pfam you can:– Look at multiple alignments – View protein domain architectures – Examine species distribution – Follow links to other databases – View known protein structures
Pfam: Databases
2 databases:• Pfam-A – curated multiple alignments.
– Grows slowly.– Quality controlled by experts.
• Pfam-B – automatic clustering (ProDom derived).– Complements Pfam-A.– New sequences instantly incorporated.– Unchecked: false positives, etc.
Pfam: Features
• Search by: Sequence, keyword, domain, taxonomy.
• Browsing by family or genome.
• Evolutionary tree
Pfam: Construction
• Source of seed alignments:– Pfam-B families.– Published articles.– 'domain hunting' studies.– occasionally using entries from other
databases (e.g. MEROPS for peptidases).
Pfam: Domain information
Pfam: Domain organization
Pfam: Multiple alignment
Pfam: HMM logo
Pfam: Species distribution
Pfam: Genome comparison
PROSITE
• Database of protein families.
• Matching according to simple patterns or PSSM profiles.
• Browsing all proteins of a specific family.
• Latest release knows 1696 protein families.
PROSITE: Features
• Comprehensive domain documentation.• All profile matches checked by experts.• Specificity/sensitivity:• Specificity: true-pos/all-pos• Sensitivity: true-pos/(true-pos + false-
neg)
PROSITE: Example
• Specificity of Zinc finger C2H2 type domain
SMART
SMART
• Simple Modular Architecture Research Tool
• Identification and annotation of genetically mobile domains and the analysis of domain architectures.
• SMART consists of a library of HMMs.
• Knows 665 HMMs to date.
SMART: Features
• finding proteins containing specific domains i.e. of the same family
• Function prediction• Sub-cellular localization• Binding partners• Architecture• Alternative splicing information• Orthology information
SMART: Domain selection example
Tyrosine kinase (TyrKc) AND Transmembrane region (TRANS)
InterPro
• InterPro combines 9 other databases such as SMART, Pfam, Prodom and more.
• Queries can use many different methods (as the other databases use different methods).
• However, thresholds are predefined and cannot be changed for those methods.
InterPro
• Provides more results, but can sometimes be redundant.
• Coverage statistics:• 93% of Swiss-Prot v42.5 –
128540 out of 138922 proteins• 81% of TrEMBL v25.5 –
819966 out of 1013263 proteins
InterPro: Features
• Searching by Protein/DNA sequences
• Finding domains & homologs
• List of InterPro entries of type: – Family– Domain– Repeat– PTM- Post Transcriptional modifications– Binding Site– Active Site– Keyword
InterPro: Example
• Kringle domain
Research Example: Introduction
• Goal: The systematic identification of novel protein domain families.
• Using computational methods.
Research Example: Method
Derive set of 107 nuclear domains
extract proteins
Extract unannotated regions
Cluster sequences
Take longest member
PSI-BLAST
Investigate homologous regions
Manual confirmation
Research Example: Results
• 28 New Domains identified:
• 15 domains in diverse contexts, in different species.
• 3 domains species specific.• 7 domains with weak similarity to
previously described domains.• 3 extension domains.
Predictions of Function
• On the basis of reports in literature and/or occurrence with other identified domains, functional features can be predicted for our novel domain families.
• Examples:– Chromatin binding– Protein Interaction– Predicted sub-cellular localization
Predictions of Function:Chromatin-Binding example
• The novel domain CSZ is contained in protein SPT6, which regulates transcription via chromatin structure modification.
• SPT6 has a histone-binding capability, experimentally confirmed.
• Other domains (S1, SH2) in SPT6 are unlikely to bind histones or chromatin.
• Conclusion: CSZ has a predicted histone binding function.
Predictions of Function:Localization example• Some of the novel domains are only
found within proteins from the initial set of nuclear domains.
• This predicts that these domains have a nuclear function.
• The other domains are likely to have roles in both nucleus and cytoplasm.
Conclusion
• Domains are the functional units of proteins.• Identifying a domain within a new protein may
teach us much about it.
• There are several types of models to represent domains.
• These models can also be used to identify the domain they represent.
• Many Internet databases available to catalogue and identify families.
• Protocol to identify new domains using old ones.
Resources
• Pfam:http://www.sanger.ac.uk/Software/Pfam/
• SMART: http://smart.embl-heidelberg.de/
• PROSITE:http://www.expasy.org/prosite/
• InterPro:http://www.ebi.ac.uk/interpro/
The End