protein modules

Protein Modules

An Introduction to Bioinformatics

To introduce the concept of multidomain proteins

AIMS

OBJECTIVES

To define the terms associated with analysis of multidomain proteins

To introduce the major secondary databases

To select an appropriate secondary database for analysis of protein domains

To carry out an analysis to establish to establish the domain structure of a protein

To ascribe likely biological functions to protein domains

When the amino acid sequences of two proteins are compared and found to exhibit significant similarity they are assumed to be evolutionarily related i.e. they are homologues

two classes of homologue (orthologue and paralogue)

orthologous genes are descended from a unique ancestral gene and their divergence with comparable genes in different organisms is simply parallel to speciation

paralogous genes are descended from copies of a gene that duplicated within a single ancestral genome

http://www.library.csi.cuny.edu/~davis/Bioinfo_326/lectures/lect5_6/orthologs3.gif

a substantial proportion of all proteins are composed of more than one domain

A domain is defined as sequentially consecutive residues in a protein that can fold up independently of other parts of the protein

Crystallographers commonly refer to domains as folds and the term module is also used

The domain/module is the fundamental unit of protein structure

inter-domain splicing, fusion, deletion, duplication and shuffling have occurred frequently during evolution, whereas intra-domain rearrangements have occurred rarely

Influenza virushaemagglutinin

When two homologous proteins are aligned, there are one or more regions where sequence identity is particularly high, and these regions frequently enable the definition of motifs or signature sequences that are diagnostic(Module 4)

Any particular domain may have one or more characteristic motifs

Domains/modules, motifs/signature sequences constitute the content of many secondary databases and are of enormous value in attempting to predict the function and structure of new proteins

Low complexity regions

The individual domains of multidomain proteins are frequently separated from each other by regions of low complexity, also referred to as linker sequences

Long stretches of repeated residues, particularly proline, glutamine, serine or threonine often indicate linker sequences

The program SEG detects such low complexity regions and can be used as part of BLAST to mask off segments of the query sequence that have low compositional complexity

This leaves the biologically interesting regions of the query sequence available for matching against database sequences

Secondary (pattern) databases

Analysis of the primary protein sequence databases, usuallythrough multiple sequence alignments has led to the identificationof sequence patterns (motifs, signatures, blocks, profiles) common to homologous proteins or protein modules

These motifs, usually of ~10-20 amino acids length, commonly correspond to key functional or structural elements, often domains/modules, and are extremely useful in identifying such features in new uncharacterized proteins

An unknown protein is often too distantly related to any protein of known sequence to detect its resemblance by overall sequence alignment, but it can potentially be identified by the occurrence in its sequence of a particular motif

mailto:[email protected]

There are a number of programs which allow the searching of an unknown protein against databases of motifs/profiles etc

Pfam is a collection of multiple alignments and profile hidden Markov models of protein domain families, which is based on proteins from both SWISS-PROT and SP-TrEMBL

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures

PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs

protein modules

Documents

protein of known sequence

protein modules

analysis of protein

unknown protein

protein crystallographers

homologous proteins

sequence identity

regions of low complexity