mapping protein to function
TRANSCRIPT
![Page 1: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/1.jpg)
Mapping Proteins to FunctionsPart 1
dsdht.wikispaces.com
![Page 2: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/2.jpg)
Points to remember
Proteins are single, unbranched chains of amino acid monomers.
There are 20 different amino acidsThere are four levels of protein structure-
primary,secondary,tertiary and quaternary.A protein’s amino acid sequence determines
its three-dimensional structure (conformation).
![Page 3: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/3.jpg)
Proteins Functional Classes
Why do we care about protein function?
• Diagnose reasons for the disease.
• Discover new drugs.• Understand Mechanism of
action of processes in the system.
![Page 4: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/4.jpg)
Data used for prediction of protein function
• Amino acid sequences • Protein structure • Genome sequences • Phylogenetic data • Microarray expression data • Protein interaction networks and protein
complexes • Biomedical literature
![Page 5: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/5.jpg)
The concept of protein function is highly context-sensitive and not very well-defined. infact, this concept typically acts as an umbrella term for all types of activities that a protein is involved in, be it cellular, molecular or physiological.
![Page 6: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/6.jpg)
Characterization on protein function
Predicting function: from genes to genomes. Bork etal 1998.
Molecular function, cellular function and Phenotypic function are hierarchically related.
![Page 7: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/7.jpg)
Gene Ontology classification scheme categorizes protein function into cellular component, molecular function and biological process.
http://www.nature.com/ng/journal/v25/n1/full/ng0500_25.html http://www.geneontology.org/
In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between pairs of concepts. It can be used to model a domain and support reasoning about entities Read more at http://www.answers.com/topic/ontology-computer-science
![Page 8: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/8.jpg)
GO Format
Figure adapted from [Ashburner et al. 2000])
• Wide coverage• Standardized format• Hierarchical structure• Disjoint Categories• Multiple functions• Dynamic nature
![Page 9: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/9.jpg)
Molecular function
• Molecular function describes activities, such as catalytic or binding activities, at the molecular level
• GO molecular function terms represent activities rather than the entities that perform the actions, and do not specify where or when, or in what context, the action takes place
• Examples of broad functional terms are catalytic activity or transporter activity; an example of a narrower term is adenylate cyclase activity
![Page 10: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/10.jpg)
Biological process
• A biological process is series of events accomplished by one or more ordered assemblies of molecular functions
• An example of a broad GO biological process terms is signal transduction; examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport.
• It can be difficult to distinguish between a biological process and a molecular function.
![Page 11: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/11.jpg)
Cellular component• A cellular component is just that, a component of a cell
that is part of some larger object
• It may be an anatomical structure (for example, the rough endoplasmic reticulum or the nucleus) or a gene product group (for example, the ribosome, the proteasome or a protein dimer)
• The cellular component categories are probably the best defined categories since they correspond to actual entities
![Page 12: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/12.jpg)
Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44-57.
![Page 13: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/13.jpg)
DAVID (Gene Ontology Enrichment)Youtube Videos http://www.youtube.com/watch?v=xIu9mm6b7N0http://www.youtube.com/watch?v=zedjRViji2cTry out the microarray list given below for analyzing Proteins.
31741_at 31734_at 32696_at 37559_at 41400_at 35985_at 39304_g_at 41438_at 35067_at 32919_at 35429_at 36674_at 967_g_at 36669_at 39242_at 39573_at 39407_at 33346_r_at 40319_at 2043_s_at 1788_s_at 36651_at 41788_i_at 35595_at 36285_at 39586_at 35160_at 39424_at 36865_at 2004_at 36728_at 37218_at 40347_at 36226_r_at 33012_at 37906_at 32872_at
![Page 14: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/14.jpg)
Sequence & Structure based methods
Part2
![Page 15: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/15.jpg)
Basic Set of Protein Annotations
• Protein name - descriptive common name for the protein eg. “kinase”• Gene symbol -mnemonic abbreviation for the gene - eg “recA”• EC number -what the protein is doing in the cell and why -eg “involved in glycolysis”• Supporting evidence - accession numbers of BER and HMM matches - whatever information you used to make the annotation• Unique Identifier - eg locus ids
![Page 16: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/16.jpg)
Sequence Similarity Evidence• pairwise alignments -two protein’s amino acid sequences aligned next to
each other so that the maximum number of amino acids match• Multiple alignment - 3 or more amino acid sequences aligned to each other
so that the maximum number of amino acids match in each column• Protein families - clusters of proteins that all share sequence similarity and
presumably similar function• Motifs -short regions of amino acid sequence shared by many proteins. A
motif can be found in number of different proteins where it carries out similar functions.
![Page 17: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/17.jpg)
Important terms to understand
• Homologs – two sequences have evolved from the same common ancestor they may not share same function
• Orthologs – a type of homolog where two sequences are in different species that arose from a common ancestor. Speciation have created the tow copies of the sequence.
• Paralogs- a type of homolog where the two sequences have arisen due to a gene duplication within one species.They initially have the same function but as time goes byone copy will be free to evolve new functions, as the other copy will maintain the original function.
• Xenologs – a type of ortholog where two gene sequences have arisen due to horizontal transfer (by means of reproduction)
![Page 18: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/18.jpg)
Taken from http://ae.igs.umaryland.edu/docs/FunctionalAnnotApril.pdf
![Page 19: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/19.jpg)
Sequence similarity, sequence homology, and functional homology
• Sequence similarity means that the sequences are similar – no more, no less
• Sequence homology implies that the proteins are encoded by genes that share a common ancestry.
• Functional homology means that two proteins from two organisms have the same function.
• Sequence similarity or sequence homology does not guarantee functional homology
![Page 20: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/20.jpg)
Existing Sequence based function prediction methods
• BLAST• FASTA• SSEARCH• PSI-BLAST-iterates searches by using a sequence profile computed from a multiple
sequence alignment obtained from the search from the previous round.
• Motifs and domains http://molbiol-tools.ca/Motifs.htm
Homology based approaches
Subsequence based approaches
Feature based approaches
• normalized Van der Waals volume, polarity, charge and surface tension, which are averaged over all the residues to in the sequence obtain the feature-value vector for the protein to train a classifier
• SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi)
![Page 21: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/21.jpg)
Drawbacks of BLAST and FASTA
• Provide functional annotation typically to half of the genes in a genome since homologous sequences are not found at accepted significance thresholds.
• Automated methods of annotation transfer between similar sequences contribute to error propagation.
![Page 22: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/22.jpg)
Enhanced Sequence based methods• PFP – Kihara lab. (http://kiharalab.org/web/pfp.php)• The PFP algorithm uses PSI-BLAST (version 2.2.6) to predict probable GO function annotations in
three categories—molecular function, biological process, and cellular component—with statistical significance scores (Pvalue)
• For each sequence retrieved by PSI-BLAST ,the associated GO terms are scored.• GO terms are scored according to a) frequency of association to similar sequences b) degree of similarity those sequences share with the query
where s(fa) is the final score assigned to the GO term fa, N is the number of similar sequences retrieved by PSI-BLAST, Nfunc(i) is the number of GO terms annotating sequence i, E_value(i) is the E-value given to the sequence i, fj is a GO term annotating sequence i, and b is the constant value, 2 = (log10100), which keeps the scorepositive. P(fa|fj) is the association score for fa given fj obtained from the function association matrix (FAM).
c(fa, fj) is number of times fa and fj are assigned simultaneously to each sequence in UniProt, and c(fj) is the total number of times fj appeared in Uni- Prot, l is the size of one dimension of the FAM (i.e. the total number of unique GO terms), and ε is the pseudocount.
![Page 23: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/23.jpg)
• Sometimes no orthologs or even paralogs can be identified by sequence similarity searches, or they are all of unknown function.
• No functional information can thus be transferred based on simple sequence homology
• By instead analyzing the various parts that make up the complete protein, it is nonetheless often possible to predict the protein function
When Homology searches fail
![Page 24: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/24.jpg)
Protein domains• Many eukaryotic proteins consist of multiple
globular domains that can fold independently
• These domains have been mixed and matched through evolution
• Each type of domain contributes towards the molecular function of the complete protein
• Numerous resources are able to identify such domains from sequence alone using HMMs
![Page 25: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/25.jpg)
![Page 26: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/26.jpg)
![Page 27: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/27.jpg)
![Page 28: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/28.jpg)
![Page 29: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/29.jpg)
Which domain resource should I use?
• SMART is focused on signal transduction domains
• Pfam is very actively developed and thus tends to have the most up-to-date domain collection
• InterPro is useful for genome annotation since the domains are annotated with GO terms
• CDD is conveniently integrated with the NCBI BLAST web interface
![Page 30: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/30.jpg)
Function prediction from post translational modifications
• Proteins with similar function may not be related in sequence
• Still they must perform their function in the context of the same cellular machinery
• Similarities in features such like PTMs and physical/chemical properties could be expected for proteinswith similar function
![Page 31: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/31.jpg)
The concept of ProtFun
http://www.cbs.dtu.dk/services/ProtFun/
![Page 32: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/32.jpg)
![Page 33: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/33.jpg)
Function prediction on thehuman prion sequence
############## ProtFun 1.1 predictions ##############
>PRIO_HUMAN# Functional category Prob Odds Amino_acid_biosynthesis 0.020 0.909 Biosynthesis_of_cofactors 0.032 0.444 Cell_envelope 0.146 2.393 Cellular_processes 0.053 0.726 Central_intermediary_metabolism 0.130 2.063 Energy_metabolism 0.029 0.322 Fatty_acid_metabolism 0.017 1.308 Purines_and_pyrimidines 0.528 2.173 Regulatory_functions 0.013 0.081 Replication_and_transcription 0.020 0.075 Translation 0.035 0.795 Transport_and_binding => 0.831 2.027
# Enzyme/nonenzyme Prob Odds Enzyme 0.250 0.873 Nonenzyme => 0.750 1.051
# Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.070 0.336 Transferase (EC 2.-.-.-) 0.031 0.090 Hydrolase (EC 3.-.-.-) 0.057 0.180 Isomerase (EC 4.-.-.-) 0.020 0.426 Ligase (EC 5.-.-.-) 0.010 0.313 Lyase (EC 6.-.-.-) 0.017 0.334
![Page 34: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/34.jpg)
ProtFun data sets• Labeling of training and test data– Cellular role categories: human SwissProt sequences
were categorizes using EUCLID– Enzyme categories: top-level enzyme classifications
were extract from human SwissProt description lines– Gene Ontology terms were transferred from InterPro
• The sequences were divided into training and test sets without significant sequence similarity
• Binary predictors were for each category
![Page 35: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/35.jpg)
Structure based methods
Three standard databases dominate the structure datalandscape:
PDB-Structure data from NMR and ,X-ray
SCOP- organizes the available structures in a hierarchy so as to elicit the evolutionary relationships between them.Family, Superfamily and Fold
CATH-(Class, Architecture, Topology and Homologous superfamily)
![Page 36: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/36.jpg)
Structure based methods
Adapted from Martin 1998. Protein folds and functions
Protein Folds Super Secondary Structures
Biological function
![Page 37: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/37.jpg)
Approaches for deriving functional information from 3D structure
Adapted from From Structure to function.Thorton etal, 2000,Nature .
![Page 38: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/38.jpg)
http://www.jove.com/video/3259/a-protocol-for-computer-based-protein-structure-function
![Page 39: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/39.jpg)
![Page 40: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/40.jpg)
ProFunc server methods
Sequence-based methods: • BLAST search against the UniProt Knowledgebase. • FASTA search against sequences of structures in the Protein Data Bank. • InterProScan • Superfamily search • Residue conservation mapped onto structure • Genome location analysis
Structure-based methods: • Fold matching using MSDfold and DALI • Helix-Turn-Helix motif search • Nest analysis • Surface clefts analysis Template methods • Enzyme active sites • Ligand binding sites • DNA binding sites Reverse template search
![Page 41: Mapping protein to function](https://reader036.vdocuments.net/reader036/viewer/2022062515/55d748a1bb61eba0178b466f/html5/thumbnails/41.jpg)
References
• http://www.sciencedirect.com/science/article/pii/S0022283698921441• http://www.nature.com/ng/journal/v25/n1/full/ng0500_25.html• http://www.nature.com/nprot/journal/v4/n1/full/nprot.2008.211.html• http://www.nature.com/nprot/journal/v4/n1/full/nprot.2008.211.html• http://www.cs.helsinki.fi/bioinformatiikka/mbi/courses/07-08/itb/slides
/itb0708_slides_83-116.pdf (BLAST and FASTA)
• http://kiharalab.org/web/paper/HawkinsChitaleLubanKihara_Proteins09.pdf
• http://www.ebi.ac.uk/thornton-srv/databases/profunc/doc/profunc_tutorial.pdf