introduction to bioinformatics lecture 13: predicting protein function centre for integrative...

45
Introduction to Bioinformatics Lecture Lecture 13 13 : : Predicting Protein Predicting Protein Function Function Centre for Centre for Integrative Bioinformatics VU (IBIVU) Integrative Bioinformatics VU (IBIVU)

Post on 21-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Introduction to Bioinformatics

Lecture Lecture 1313: : Predicting Protein Function Predicting Protein Function

Centre for Centre for Integrative Bioinformatics VU (IBIVU)Integrative Bioinformatics VU (IBIVU)

Page 2: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

The deluge of genomic information begs the following question: what do all these genes do?

Many genes are not annotated, and many more are partially or erroneously annotated. Given a genome which is partially annotated at best, how do we fill in the blanks?

Of each sequenced genome, 20%-50% of the functions of proteins encoded by the genomes remains unknown!

Protein Function Prediction

Page 3: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

We are faced with the problem of predicting protein function from sequence, genomic, expression, interaction and structural data. For all these reasons and many more, automated protein function prediction is rapidly gaining interest among bioinformaticians and computational biologists

Protein Function Prediction

Page 4: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Outline Sequence-based function prediction

Structure-based function prediction– Sequence-structure comparison– Structure-structure comparison

Motif-based function prediction

Phylogenetic profile analysis

Protein interaction prediction and databases

Functional inference at systems level

Page 5: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Classes of function prediction methods Sequence based approaches

– protein A has function X, and protein B is a homolog (ortholog) of protein A; Hence B has function X

Structure-based approaches– protein A has structure X, and X has so-so structural features;

Hence A’s function sites are ….

Motif-based approaches– a group of genes have function X and they all have motif Y; protein

A has motif Y; Hence protein A’s function might be related to X

Function prediction based on “guilt-by-association”– gene A has function X and gene B is often “associated” with gene A,

B might have function related to X

Page 6: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Sequence-based function prediction Homology searching Sequence comparison is a powerful tool for detection

of homologous genes but limited to genomes that are not too distant away

uery: 2   LSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDL 61           LSD +   V  +W K+       G + L R+   +P+T   F  +      D    S ++Sbjct: 3   LSDKDKAAVRALWSKIGKSSDAIGNDALSRMIVVYPQTKIYFSHWP-----DVTPGSPNI 57

Query: 62  KKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPG 121           K HG  V+  +   + K    +  +  L++ HA K ++     + ++ CI+ V+ +  PSbjct: 58  KAHGKKVMGGIALAVSKIDDLKTGLMELSEQHAYKLRVDPSNFKILNHCILVVISTMFPK 117

Query: 122 DFGADAQGAMNKALELFRKDMASNYK 147           +F  +A  +++K L      +A  Y+Sbjct: 118 EFTPEAHVSLDKFLSGVALALAERYR 143

We have done homology searching (FASTA, BLAST, PSI-BLAST) in earlier lectures

Page 7: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Structure-based function prediction

Structure-based methods could possibly detect remote homologues that are not detectable by sequence-based method– using structural information in addition to sequence

information– protein threading (sequence-structure alignment) is a

popular method

Structure-based methods could provide more than just “homology” information

Page 8: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Threading

Query sequence

Template sequence

+

Template structure

Compatibility score

Page 9: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Threading

Query sequence

Template sequence

+

Template structure

Compatibility score

Page 10: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Structure-based function prediction

Threading Scoring function for measuring to what extend query sequence fits into template structure

For scoring we have to map an amino acid (query sequence) onto a local environment (template structure)

We can use structural features for this:

o Secondary structure

o Is environment inside or outside? – Residue accessible surface area (ASA)

o Polarity of environment

The best (highest scoring) “thread” through the structure gives a so-called structural alignment, this looks exactly the same as a sequence alignment but is based on structure.

Page 11: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Fold recognition by threading

Query sequence

Compatibility scores

Fold 1

Fold 2

Fold 3

Fold N

Page 12: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Structure-based function prediction SCOP (http://scop.berkeley.edu/) is a protein structure

classification database where proteins are grouped into a hierarchy of families, superfamilies, folds and classes, based on their structural and functional similarities

Page 13: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Structure-based function prediction SCOP hierarchy – the top level: 11 classes

Page 14: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Structure-based function prediction

All-alpha protein

Coiled-coil proteinAll-beta protein

Alpha-beta proteinmembrane protein

Page 15: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Structure-based function prediction SCOP hierarchy – the second level: 800 folds

Page 16: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Structure-based function prediction SCOP hierarchy - third level: 1294 superfamilies

Page 17: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Structure-based function prediction

SCOP hierarchy - third level: 2327 families

Page 18: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Structure-based function prediction

Using sequence-structure alignment method, one can predict a protein belongs to a

– SCOP familiy, superfamily or fold

Proteins predicted to be in the same SCOP family are orthologous Proteins predicted to be in the same SCOPE superfamily are homologous Proteins predicted to be in the same SCOP fold are structurally

analogous

folds

superfamilies

families

Page 19: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Structure-based function prediction

Prediction of ligand binding sites– For ~85% of ligand-binding proteins, the largest largest cleft

is the ligand-binding site– For additional ~10% of ligand-binding proteins, the second

largest cleft is the ligand-binding site

Page 20: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Structure-based function prediction

Prediction of macromolecular binding site– there is a strong correlation between macromolecular

binding site (with protein, DNA and RNA) and disordered protein regions

– disordered regions in a protein sequence can be predicted using computational methods

http://www.pondr.com/

Page 21: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Motif-based function prediction

Prediction of protein functions based on identified sequence motifs

PROSITE contains patterns specific for more than a thousand protein families.

ScanPROSITE -- it allows to scan a protein sequence for occurrence of patterns and profiles stored in PROSITE

Page 22: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Motif-based function prediction

Search PROSITE using ScanPROSITE

The sequence has ASN_GLYCOSYLATION N-glycosylation site: 242 - 245 NETL

MSEGSDNNGDPQQQGAEGEAVGENKMKSRLRKGALKKKNVFNVKDHCFIARFFKQPTFCSHCKDFICGYQSGYAWMGFGKQGFQCQVCSYVVHKRCHEYVTFICPGKDKG IDSDSPKTQH ……..

Page 23: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Regular expressions

Alignment

ADLGAVFALCDRYFQSDVGPRSCFCERFYQADLGRTQNRCDRYYQADIGQPHSLCERYFQ

Regular expression

[AS]-D-[IVL]-G-x4-{PG}-C-[DE]-R-[FY]2-Q

{PG} = not (P or G)

For short sequence stretches, regular expressions are often more suitable to describe the information than alignments (or profiles)

Page 24: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Regular expressions

Regular expression No. of exact matches in DB

D-A-V-I-D 71

D-A-V-I-[DENQ] 252

[DENQ]-A-V-I-[DENQ] 925

[DENQ]-A-[VLI]-I-[DENQ] 2739

[DENQ]-[AG]-[VLI]2-[DENQ] 51506

D-A-V-E 1088

Page 25: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Phylogenetic profile analysis

Function prediction of genes based on “guilt-by-association” – a non-homologous approach

The phylogenetic profile of a protein is a string that encodes the presence or absence of the protein in every sequenced genome

Because proteins that participate in a common structural complex or metabolic pathway are likely to co-evolve, the phylogenetic profiles of such proteins are often ``similar''

Page 26: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Phylogenetic profile analysis

Phylogenetic profile (against N genomes)– For each gene X in a target genome (e.g., E coli),

build a phylogenetic profile as follows– If gene X has a homolog in genome #i, the ith bit

of X’s phylogenetic profile is “1” otherwise it is “0”

Page 27: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Phylogenetic profile analysis

Example – phylogenetic profiles based on 60 genomes

orf1034:1110110110010111110100010100000000111100011111110110111010101orf1036:1011110001000001010000010010000000010111101110011011010000101orf1037:1101100110000001110010000111111001101111101011101111000010100orf1038:1110100110010010110010011100000101110101101111111111110000101orf1039:1111111111111111111111111111111111111111101111111111111111101orf104: 1000101000000000000000101000000000110000000000000100101000100orf1040:1110111111111101111101111100000111111100111111110110111111101orf1041:1111111111111111110111111111111101111111101111111111111111101orf1042:1110100101010010010110000100001001111110111110101101100010101orf1043:1110100110010000010100111100100001111110101111011101000010101orf1044:1111100111110010010111010111111001111111111111101101100010101orf1045:1111110110110011111111111111111101111111101111111111110010101orf1046:0101100000010001011000000111110000010100000001010010100000000orf1047:0000000000000001000010000001000100000000000000010000000000000orf105: 0110110110100010111101101010111001101100101111100010000010001orf1054:0100100110000001100001000100000000100100100001000100100000000

Genes with similar phylogenetic profiles have related functions or functionally linked – D Eisenberg and colleagues (1999)

By correlating the rows (open reading frames (ORF) or genes) you find out about joint presence or absence of genes: this is a signal for a functional connection

gene

genome

Page 28: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Phylogenetic profile analysis

Phylogenetic profiles contain great amount of functional information

Phlylogenetic profile analysis can be used to distinguish orthologous genes from paralogous genes

Subcellular localization: 361 yeast nucleus-encoded mitochondrial proteins are identified at 50% accuracy with 58% coverage through phylogenetic profile analysis

Functional complementarity: By examining inverse phylogenetic profiles, one can find functionally complementary genes that have evolved through one of several mechanisms of convergent evolution.

Page 29: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Prediction of protein-protein interactions

Rosetta stone

Gene fusion is the an effective method for prediction of protein-protein interactions– If proteins A and B are homologous to two domains of a

protein C, A and B are predicted to have interaction

Though gene-fusion has low prediction coverage, it false-positive rate is low

A B

C

Page 30: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Domain fusion exampleVertebrates have a multi-enzyme protein (GARs-AIRs-GARt) comprising the enzymes GAR synthetase (GARs), AIR synthetase (AIRs), and GAR transformylase (GARt) 1. In insects, the polypeptide appears as GARs-(AIRs)2-GARt. However, GARs-AIRs is encoded separately from GARt in yeast, and in bacteria each domain is encoded separately (Henikoff et al., 1997).

1GAR: glycinamide ribonucleotide synthetase AIR: aminoimidazole ribonucleotide synthetase

Page 31: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Protein interaction database There are numerous databases of protein-protein

interactions

DIP is a popular protein-protein interaction database

The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions.

Page 32: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Protein interaction databases

BIND - Biomolecular Interaction Network DatabaseDIP - Database of Interacting ProteinsPIM – HybrigenicsPathCalling Yeast Interaction Database MINT - a Molecular Interactions DatabaseGRID - The General Repository for Interaction DatasetsInterPreTS - protein interaction prediction through tertiary structureSTRING - predicted functional associations among genes/proteinsMammalian protein-protein interaction database (PPI)InterDom - database of putative interacting protein domains FusionDB - database of bacterial and archaeal gene fusion eventsIntAct ProjectThe Human Protein Interaction Database (HPID)ADVICE - Automated Detection and Validation of Interaction by Co-evolutionInterWeaver - protein interaction reports with online evidencePathBLAST - alignment of protein interaction networksClusPro - a fully automated algorithm for protein-protein dockingHPRD - Human Protein Reference Database

Page 33: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Protein interaction database

Page 34: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Network of protein interactions and predicted functional links involving silencing information regulator (SIR) proteins. Filled circles represent proteins of known function; open circles represent proteins of unknown function, represented only by their Saccharomyces genome sequence numbers ( http://genome-www.stanford.edu/Saccharomyces). Solid lines show experimentally determined interactions, as summarized in the Database of Interacting Proteins19 (http://dip.doe-mbi.ucla.edu). Dashed lines show functional links predicted by the Rosetta Stone method12. Dotted lines show functional links predicted by phylogenetic profiles16. Some predicted links are omitted for clarity.

Page 35: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Network of predicted functional linkages involving the yeast prion protein20 Sup35. The dashed line shows the only experimentally determined interaction. The other functional links were calculated from genome and expression data11 by a combination of methods, including phylogenetic profiles, Rosetta stone linkages and mRNA expression. Linkages predicted by more than one method, and hence particularly reliable, are shown by heavy lines. Adapted from ref. 11.  

Page 36: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

STRING - predicted functional associations among genes/proteins

STRING is a database of predicted functional associations among genes/proteins.

Genes of similar function tend to be maintained in close neighborhood, tend to be present or absent together, i.e. to have the same phylogenetic occurrence, and can sometimes be found fused into a single gene encoding a combined polypeptide.

STRING integrates this information from as many genomes as possible to predict functional links between proteins.

Berend Snel en Martijn Huynen (RUN) and the group of Peer Bork (EMBL, Heidelberg)

Page 37: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

STRING - predicted functional associations among genes/proteins STRING is a database of known and predicted protein-protein interactions.The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources:

1. Genomic Context (Synteny) 2. High-throughput Experiments  3. (Conserved) Co-expression  4. Previous Knowledge

STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently contains 736429 proteins in 179 species

Page 38: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

STRING - predicted functional associations among genes/proteins

Conserved Neighborhood

This view shows runs of genes that occur repeatedly in close neighborhood in (prokaryotic) genomes. Genes located together in a run are linked with a black line (maximum allowed intergenic distance is 300 bp). Note that if there are multiple runs for a given species, these are separated by white space. If there are other genes in the run that are below the current score threshold, they are drawn as small white triangles. Gene fusion occurences are also drawn, but only if they are present in a run (see also the Fusion section below for more details).

Page 39: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Functional inference at systems level

Function prediction of individual genes could be made in the context of biological pathways/networks

Example – phoB is predicted to be a transcription regulator and it regulates all the genes in the pho-regulon (a group of co-regulated operons); and within this regulon, gene A is interacting with gene B, etc.

Page 40: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Functional inference at systems level

KEGG is database of biological pathways and networks

Page 41: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Functional inference at systems level

Page 42: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Functional inference at systems level

Page 43: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Functional inference at systems level

By doing homologous search, one can map a known biological pathway in one organism to another one; hence predict gene functions in the context of biological pathways/networks

Page 44: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Wrapping up

We have seen a number of ways to infer a putative function for a protein sequence

To gain confidence, it is important to combine as many different prediction protocols as possible (the STRING server is an example of this)

Page 45: Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Homework

Give an example of two proteins having the same structural fold but different biological functions through searching SCOP and Swiss-prot

What is the biological function of phoR in the two-component system of prokaryotic organism based on KEGG database search