protein modules

21
Protein Modules An Introduction to Bioinformatics

Upload: truong

Post on 15-Jan-2016

53 views

Category:

Documents


0 download

DESCRIPTION

An Introduction to Bioinformatics. Protein Modules. AIMS. To introduce the concept of multidomain proteins. To define the terms associated with analysis of multidomain proteins. To introduce the major secondary databases. OBJECTIVES. - PowerPoint PPT Presentation

TRANSCRIPT

Protein Modules

An Introduction to Bioinformatics

To introduce the concept of multidomain proteins

AIMS

OBJECTIVES

To define the terms associated with analysis of multidomain proteins

To introduce the major secondary databases

To select an appropriate secondary database for analysis of protein domains

To carry out an analysis to establish to establish the domain structure of a protein

To ascribe likely biological functions to protein domains

When the amino acid sequences of two proteins are compared and found to exhibit significant similarity they are assumed to be evolutionarily related i.e. they are homologues

two classes of homologue (orthologue and paralogue)

orthologous genes are descended from a unique ancestral gene and their divergence with comparable genes in different organisms is simply parallel to speciation

paralogous genes are descended from copies of a gene that duplicated within a single ancestral genome

a substantial proportion of all proteins are composed of more than one domain

A domain is defined as sequentially consecutive residues in a protein that can fold up independently of other parts of the protein

Crystallographers commonly refer to domains as folds and the term module is also used

The domain/module is the fundamental unit of protein structure

inter-domain splicing, fusion, deletion, duplication and shuffling have occurred frequently during evolution, whereas intra-domain rearrangements have occurred rarely

Influenza virushaemagglutinin

When two homologous proteins are aligned, there are one or more regions where sequence identity is particularly high, and these regions frequently enable the definition of motifs or signature sequences that are diagnostic(Module 4)

Any particular domain may have one or more characteristic motifs

Domains/modules, motifs/signature sequences constitute the content of many secondary databases and are of enormous value in attempting to predict the function and structure of new proteins

Low complexity regions

The individual domains of multidomain proteins are frequently separated from each other by regions of low complexity, also referred to as linker sequences

Long stretches of repeated residues, particularly proline, glutamine, serine or threonine often indicate linker sequences

The program SEG detects such low complexity regions and can be used as part of BLAST to mask off segments of the query sequence that have low compositional complexity

This leaves the biologically interesting regions of the query sequence available for matching against database sequences

Secondary (pattern) databases

Analysis of the primary protein sequence databases, usuallythrough multiple sequence alignments has led to the identificationof sequence patterns (motifs, signatures, blocks, profiles) common to homologous proteins or protein modules

These motifs, usually of ~10-20 amino acids length, commonly correspond to key functional or structural elements, often domains/modules, and are extremely useful in identifying such features in new uncharacterized proteins

An unknown protein is often too distantly related to any protein of known sequence to detect its resemblance by overall sequence alignment, but it can potentially be identified by the occurrence in its sequence of a particular motif

There are a number of programs which allow the searching of an unknown protein against databases of motifs/profiles etc

Pfam is a collection of multiple alignments and profile hidden Markov models of protein domain families, which is based on proteins from both SWISS-PROT and SP-TrEMBL

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures

PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs