essential info notes-1

57
Biological Databases: Why?  There are two main functions of Biological Databases:  Making Biological Data available to Scientists : As much of information should be available in one single place (book, sit, database). Public data ay be difficult to find or access, and collecting it from literature is very time consuming. And not all data is actually published explicitly in an article.  To make Biological Data available in Computer-readable form : Since analysis of Biological Data almost always involves Computers, having the Data in Computer- readable form ( rather than print or paper) is a necessary first step.  One of the first Biological sequence Database was probably the book Atlas of Protein Sequence and Structureby Margaret Dayhoff and colleagues, first pu  blished in 1965. It contained the Protein sequences determined at the time, and new editions of the book were published well into the 1970s. The Computer became h storage medium of choice as soon they came with in the reach of normal scientists. Databases were distributed on tapes, and later on various kinds of discs. When universities and research institutions were connected to Internet or its  precursors (National Computer Network), it is easy to understand why it became the medium of choice. And it is easier to see why WWW ( World Wide Web) based on http (Hyper text markup language) since beginning of the 1990s is the standard method of Communication and access for nearly all biological Databases.  As biology has increasingly turned into a data-rich science, the need for storing and communicating large database has grown tremendously. The obvious examples are the nucleotide sequences, the protein sequences, and the 3D structural Data produced by X- Ray crystallography and macromolecular NMR. An new field of Science dealing with issue, challenges and new possibilities created by these database has emerged: Bioinformatics. Other type of data that or will soon be available in databases are metabolic pathways ( KEGG), gene expression data (microarrays), protein-protein interactions and other types of data related to Biological function and processes.  Biological databases have become an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life. The biological knowledge is distributed amongst many different general and specialized databases. This sometimes makes it difficult to ensure the consistency of information. Biological databases cross-reference other databases with accession numbers as one way of linking their related knowledge together.  An important resource for finding biological databases is a special yearly issue of the  journal Nucleic Acids Research (NAR). The Database Issue of NAR is freely available, and categorizes many of the publicly vailable online databases related to biology and  bioinformatics.  Most important public databases for molecular biology 

Upload: jayanthbumaiyya

Post on 14-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 157

Biological Databases Why

There are two main functions of Biological Databases

Making Biological Data available to Scientists As much of information should be

available in one single place (book sit database) Public data ay be difficult to find or

access and collecting it from literature is very time consuming And not all data is

actually published explicitly in an article To make Biological Data available in Computer-readable form Since analysis of

Biological Data almost always involves Computers having the Data in Computer-

readable form ( rather than print or paper) is a necessary first step

One of the first Biological sequence Database was probably the book ―Atlas of Protein

Sequence and Structure by Margaret Dayhoff and colleagues first pu blished in 1965 It

contained the Protein sequences determined at the time and new editions of the book

were published well into the 1970s

The Computer became h storage medium of choice as soon they came with in the reach

of normal scientists Databases were distributed on tapes and later on various kinds of

discs When universities and research institutions were connected to Internet or its

precursors (National Computer Network) it is easy to understand why it became themedium of choice And it is easier to see why WWW ( World Wide Web) based on http

(Hyper text markup language) since beginning of the 1990s is the standard method of

Communication and access for nearly all biological Databases

As biology has increasingly turned into a data-rich science the need for storing and

communicating large database has grown tremendously The obvious examples are the

nucleotide sequences the protein sequences and the 3D structural Data produced by X-

Ray crystallography and macromolecular NMR An new field of Science dealing with

issue challenges and new possibilities created by these database has emerged

Bioinformatics Other type of data that or will soon be available in databases are

metabolic pathways ( KEGG) gene expression data (microarrays) protein-protein

interactions and other types of data related to Biological function and processes

Biological databases have become an important tool in assisting scientists to understand

and explain a host of biological phenomena from the structure of biomolecules and their

interaction to the whole metabolism of organisms and to understanding the evolution of

species This knowledge helps facilitate the fight against diseases assists in the

development of medications and in discovering basic relationships amongst species in

the history of life

The biological knowledge is distributed amongst many different general and specialized

databases This sometimes makes it difficult to ensure the consistency of information

Biological databases cross-reference other databases with accession numbers as one way

of linking their related knowledge together An important resource for finding biological databases is a special yearly issue of the

journal Nucleic Acids Research (NAR) The Database Issue of NAR is freely available

and categorizes many of the publicly vailable online databases related to biology and

bioinformatics

Most important public databasesfor molecular biology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 257

Primary Sequence DBs(collaborative project)ltFONTlt H3gt

DDBJ (DNA DataBase of Japan) EMBL Nucleotide DB (European Molecular Biology Laboratory )

GenBank (National Center for Biotechnology Information)

Meta-DBs

Entrez Gene Unified retrival of gene-centred information (NCBI)

euGenes Assembled information on eukaryotic genomes (Univ of Indiana)

GeneCards (Weizmann Inst)

GenLoc UDB (Weizmann Inst)

SOURCE (Univ of Stanford)

LocusLink (National Center for Biotechnology Information)

Genome Annotation Systems

Ensembl Genome BrowserAutomatically Annotated Genomes (EMBL-EBI and

Wellcome Trust Sanger Inst)

UniGene Automatic partitioning of GenBank sequences (NCBI)

Golden Path UCSC (Univ of California Santa Cruz)

Specialized DBs

CGAP Cancer Genes (National Cancer Institute)

Clone Registry Clone Collections (National Center for Biotechnology

Information)

IMAGE Clone Collections (Image Consortium)

DBGET Hsapiens retrieval system (Univ of Kyoto)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 357

DIP Interacting Proteins (Univ of California)

GDB (Human Genome Organization)

KEGG Functional Db (Univ of Kyoto)

MGI Mouse Genome (Jackson Lab)

OMIM Inherited Diseases (National Center for Biotechnology Information)

SWISS-PROT Protein Db (Swiss Institute of Bioinformatics)

PEDANT Protein Db (Forschungszentrum f Umwelt amp Gesundheit)

List with SNP-Databases

Reactome The Genome Knowledgebase (EBI)

Microarray-DBs

ArrayExpress (European Bioinformatic Institute)

Gene Expression Omnibus (National Center for Biotechnology Information)

maxd (Univ of Manchester)

SMD (Univ of Stanford)

Accession codes Vs identifiers

Many databases in bioinformatics (SWISS-PROT EMBL GenBank Pfam) use a

system where an entry can be identified in two different ways Basically it has two

names

Identifier

Accession code (or number)

The question how to deal with changed updated and deleted entries in databases is

a very tricky problem and the policies for how accession codes and identifiers are

changed or kept constant are not completely consistent between databases or even

over time for one single database

The exact definition of what the identifier and accession code are supposed to denote

varies between the different databases but the basic idea is the following

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 457

Identifier

An identifier (ldquolocusrdquo in GenBank ldquoentry namerdquo in SWISS-PROT) is a string of

letters and digits that generally is interpretable in some meaningful way by a human

for instance as a recognizable abbreviation of the full protein or gene name

SWISS-PROT uses a system where the entry name consists of two parts the first

denotes the protein and the second part denotes the species it is found in For

example KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo

sapiens

An identifier can usually change For example the database curators may decide

that the identifier for an entry no longer is appropriate However this does not

happen very often In fact it happens so rarely that itrsquos not really a big problem

Accession code (number)

An accession code (or number) is a number (possibly with a few characters in

front) that uniquely identifies an entry in its database For example the accession

code for KRAF_HUMAN in SWISS-PROT isP04049

The main conceptual difference from the identifier is that it is supposed to be stable

any given accession code will as soon as it has been issued always refer to that

entry or its ancestors It is often called the primary key for the entry The

accession code once issued must always point to its entry even after large changes

have been made to the entry This means that in discussions about specific database

entries (eg an article about a specific protein) one should always give the

accession code for the entry in the relevant database

In the case where two entries are merged into one single then the new entry

will have both accession codes where one will be theprimary and the other

the secondary accession code When an entry is split into two both new entries

will get new accession codes but will also have the old accession code as secondary

codes

NUCLEOTIDE DATABASES

NCBIrsquos sequence databases accept genome data from sequencing projects

from around the world and serve as the cornerstone of bioinformatics

research

GenBank An annotated collection of all publicly available nucleotide and amino acid

sequences

EST database A collection of expressed sequence tags or short single-pass

sequence reads from mRNA (cDNA)

GSS database A database of genome survey sequences or short single-pass

genomic sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 557

HomoloGene A gene homology tool that compares nucleotide sequences between

pairs of organisms in order to identify putative orthologs

HTG database A collection of high-throughput genome sequences from large-scale

genome sequencing centers including unfinished and finished sequences

SNPs database A central repository for both single-base nucleotide substitutions

and short deletion and insertion polymorphisms RefSeq A database of non-redundant reference sequences standards including

genomic DNA contigs mRNAs and proteins for known genes Multiple collaborations

both within NCBI and with external groups support our data-gathering efforts

STS database A database of sequence tagged sites or short sequences that are

operationally unique in the genome

UniSTS A unified non-redundant view of sequence tagged sites (STSs)

UniGene A collection of ESTs and full-length mRNA sequences organized into

clusters each representing a unique known or putative human gene annotated with

mapping and expression information and cross-references to other sources

DNA amp RNA Databases

Major Sequence Repositories ndash Human Chromosome Information ndash Organelle

Genome Databases ndash RNA Databases ndash Comparative amp Phylogenetic Databases ndash

SNPs Mutations and Variations Databases ndash Alternative Splicing Databases ndash

Specialized Databases

Major Sequence Repositories

DDBJ DNA databank of Japan

EMBL Maintained by EMBLGenBank Maintained by NCBI

Human Chromosome Information

Click the link below to access chromosome information

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

17 18 19 20

21 22 X Y

Organelle Genome Databases

OGMP Organell genome megasequencing program

GOBASE An organelle genome database

MitoMap Human mitochondrial genome database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 657

RNA Databases

Rfam RNA familiy database

RNA base Database of RNA structures

tRNA database Database of tRNAs

tRNA tRNA sequences and genes

sRNA Small RNA database

Comparative amp Phylogenetic Databases

COG Phylogenetic classification of proteins

DHMHD Human-mouse homology database

HomoloGene Gene homologies across species

Homophila Human disease to Drosophila gene database

HOVERGEN Database of homologous vertebrate genes

TreeBase A database of phylogenetic knowledge

XREF Cross-referencing with model organisms

SNPs Mutations amp Variations Databases

ALPSbase Database of mutations causing human ALPS

dbSNP Single nucleotide polymorphism database at NCBI

HGVbase Human Genome Variation database

Alternative Splicing Databases

ASAP Alternate splicing analysis tool at UCLA

ASG Alternate splicing gallery

HASDB Human alternative splicing database at UCLA

AsMamDB alternatively spliced genes in human mouse and rat

ASD Alternative splicing database at CSHL

Specialised Databases

ABIM Links to several genomics database

ACUTS Ancient conserved untranslated sequences

AGSD Animal genome size database

AmiGO The Gene Ontology database

ARGH The acronym database

ASDB Database of alternatively spliced genes

BACPAC BAC and PAC genomic DNA library info

BBID Biological Biochemical image database

Cardiac gene database CHLC Genetic markers on chromosomes

COGENT Complete genome tracking database

COMPEL Composite regulatory elements in eukaryotes

CUTG Codon usage database

dbEST Database of expressed sequences or mRNA

dbGSS Genome survey sequence database

dbSTS Sequence tagged sites (STS)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 757

DBTSS Database of transcriptional start sites

DOGS Database of genome sizes

EID The exon-intron database ndash Harvard

Exon-Intron Exon-Intron database ndash Singapore

EPD Eukaryotic promotor database

FlyTrap HTML based gene expression databaseGDB The genome database

GenLink Resources for human genetic and telomere research

GeneKnockouts Gene knockout information

GENOTK Human cDNA database

GEO Gene expression omnibus NCBI

GOLD Information on genome projects around the world

GSDBThe Genome Sequence DataBase

HGI TIGR human gene index

HTGS High-through-put genomic sequence at NCBI

IMAGE The largest collection of DNA sequences clones

IMGT The international ImMunoGeneTics information system

IPCN Index to Plant Chromosome Numbers database

LocusLink Single query interface to sequence and genetic loci

TelDB The telomere database

MitoDat Mitochondrial nuclear genes

Mouse EST NIA mouse cDNA project

MPSS Searchable databases of several species

NDB Nucleic acid database

NEDO Human cDNA sequence database

NPD Nuclear protein database

Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute

PLACE Database of plant cis-acting regulatory DNA elements

RDP Ribosomal database project

RDB Receptor database at NIHS Japan

Refseq The NCBI reference sequence project

RHdb Radiation hybrid physical map of chromosomes

SHIGAN SHared Information of GENetic resources Japan

SpliceDB Canonical and non-canonical splice site sequences

STACK Consensus human EST database

TAED The adaptive evolution database

TIGR Curated databases of microbes plants and humans

TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database

UniGene Cluster of sequences for unique genes at NCBI

UniSTS Nonredundent collection of STS

Protein Databases

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 857

Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs

and Signatures ndash Others

Protein Sequence Databases

Antibodies Sequence and Structure BRENDA Enzyme database

CD Antigens Database of CD antigens

dbCFC Cytokine family database

Histons Histone sequence database

HPRD Human protein reference database

InterPro Intergrated documentation 5resources for protein families

iProClass An integrated protein classification database

KIND A non-redundant protein sequence database

MHCPEP Database of MHC binding peptides

MIPS Munich information centre for protein sequences

PIR Annotated and non-redundant protein sequence database

PIR-ALN Curated database of protein sequence alignments

PIR-NREF PIR nonredundent reference protein database

PMD Protein mutant database

PRF Protein research foundation Japan

ProClass Non-redundant protein database

ProtoMap Hierarchical classification of swissprot proteins

REBASE Restriction enzyme database

RefSeq Reference sequence database at NCBI

SwissProt Curated protein sequence database

SPTR Comprehensive protein sequence database

Transfac Transcription factor database

TrEMBL Annotated translations of EMBL nucleotide sequences

Tumor gene database Genes with cancer-causing mutations

WD repeats WD-repeat family of proteins

Protein Structure Databases

Cath Protein structure classification

HIV Protease HIV protease database 3D structure

PDB 3-D macromolecular structure data

PSI Protein structure initiative

S2F Structure to function projectScop Structural Classification of Proteins

Protein Domains Motifs amp Signatures

BLOCKS Multipe aligned segments of conserved protein regions

CCD Conserved domain database and search service

DOMO Homologous protein domain families

Pfam Database of protein domains and HMMs

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 957

ProDom Protein domain database

Prints Protein motif fingerprint database

Prosite Database of protein families and domains

SMART Simple modular architecture research tool

TIGRFAM Protein families based on HMMs

Others

Phospho Site Database of phosphorylation sites

PROW Protein reviews on the web

Protein Lounge Complete systems biology

Other Databases

Carbohydarate Databases

Carb DB Carbohydrate Sequence and Structure Database

GlycoWord Glycoscience related information

SPECARB Raman Spectra of carbohydrates

Other Databases

AlzGene Alzheimerrsquos disease

Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia

Model Organism Databases and Resources

Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash

Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash

Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash

Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish

General Information

GMOD Generic Model Organism Database

Model Organisms The WWW virtual library of model organisms

Arabidopsis thaliana

ABRC Arabidopsis biological resource center

AGI Arabidopsis genome initiative

AREX Arabidopsis gene expression database

Arabinet Arabidopsis information on the www

AtGDB An Arabidopsis thalina plant genome database

AtGI TIGR Arabidopsis thaliana gene index

ATGC Genome sequencing at ATGC

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1057

ATIDB Arabidopsis insertion database

CSHL Arabidopsis genome analysis at Cold Spring

ESSA Arabidopsis thalina project at MIPS

Genoscope AGI in France

Kazusa Arabidopsis thaliana genome info Japan

MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center

Stanford Sequencing of the Arabidopsis genome at Stanford

TAIR Arabidopsis information resource

TIGR TIGR Arabidopsis genome annotation database

Wustl Arabidopsis genome at Washington university

Trees A forest tree genome database

Bacterial genomes

B Subtilus Bacillus subtilus database

Chlamydomonas Chlamydomonas genetics center

E coli Ecoli genome project

MGD Microbial germ plasm database

Microbial Microbial Genome Gateway

Microbial Microbial genomes

Micado Genetics maps of B subtilis and E coli

MycDB A integrated Mycobacterial database

Neisseria Neisseria meningitidis genome

Neurospora Neurospora crassa database

OralGen Oral pathogen database

Salmonella Salmonella information

STDGen Sexulally transmitted disease database

Bass

Bass Sea Bass Mapping project

Cat (Felis catus)

Cat ArkDB Cat mapping database

Cattle (Bos taurus)

ARK Farm animals

BoLA Bovine MHC information

Bovin Bovine genome databaseBovMap Mapping the bovine genome

CaDBase Genetic diversity in cattles

ComRad Comparative radiation hybrid mapping

Cow ArkDB Bovine ArkDB

GemQual Genetics of meat quality

Chicken (Gallus gallus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1157

Chicken Poultry gene mapping project

ChickMap Chicken genome project

Chicken ArkDB Chicken database

ChickEST Chick EST database

Poultry Poultry genome project

Cotton

Cotton Cotton data collection site

Cyano Bacteria (Blue green algae)

Cyano Bacteria Anabaena genome

Daphnia (Crustacea)

Daphnia pulex Daphnia genomics consortium

Deer

Deer ArkDB Deer mapping database

Dictyostelium discoideum

Dicty_cDB Dictyostelium discoideum cDNA project

DGP Dictyostelium discoideum genome project

Dictybase Online informatics resources for Dictyostelium

Dog (Canis familiaris)

Dog Dog genome project

Dog genome project

Frog (Xenopus)

Xenbase A Xenopus web resource

Xenopus Xenopus tropicalis genome

Fruit fly (Drosophila melanogaster)

ENSEMBL Drosophila Genome Browser at ENSEMBL

Fruitfly Drosophila genome project at Berkeley

FlyBase A Database of the Drosophila Genome

FlyMove A Drosophila multimedia database

FlyView A Drosophila image database

Fungus

Aspergillus Aspergillus Genomics

Candida Candida albicans information page

FungalWeb Fungi database

FGSC Fungal genetic stocks center

Goat (Capra hircus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 2: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 257

Primary Sequence DBs(collaborative project)ltFONTlt H3gt

DDBJ (DNA DataBase of Japan) EMBL Nucleotide DB (European Molecular Biology Laboratory )

GenBank (National Center for Biotechnology Information)

Meta-DBs

Entrez Gene Unified retrival of gene-centred information (NCBI)

euGenes Assembled information on eukaryotic genomes (Univ of Indiana)

GeneCards (Weizmann Inst)

GenLoc UDB (Weizmann Inst)

SOURCE (Univ of Stanford)

LocusLink (National Center for Biotechnology Information)

Genome Annotation Systems

Ensembl Genome BrowserAutomatically Annotated Genomes (EMBL-EBI and

Wellcome Trust Sanger Inst)

UniGene Automatic partitioning of GenBank sequences (NCBI)

Golden Path UCSC (Univ of California Santa Cruz)

Specialized DBs

CGAP Cancer Genes (National Cancer Institute)

Clone Registry Clone Collections (National Center for Biotechnology

Information)

IMAGE Clone Collections (Image Consortium)

DBGET Hsapiens retrieval system (Univ of Kyoto)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 357

DIP Interacting Proteins (Univ of California)

GDB (Human Genome Organization)

KEGG Functional Db (Univ of Kyoto)

MGI Mouse Genome (Jackson Lab)

OMIM Inherited Diseases (National Center for Biotechnology Information)

SWISS-PROT Protein Db (Swiss Institute of Bioinformatics)

PEDANT Protein Db (Forschungszentrum f Umwelt amp Gesundheit)

List with SNP-Databases

Reactome The Genome Knowledgebase (EBI)

Microarray-DBs

ArrayExpress (European Bioinformatic Institute)

Gene Expression Omnibus (National Center for Biotechnology Information)

maxd (Univ of Manchester)

SMD (Univ of Stanford)

Accession codes Vs identifiers

Many databases in bioinformatics (SWISS-PROT EMBL GenBank Pfam) use a

system where an entry can be identified in two different ways Basically it has two

names

Identifier

Accession code (or number)

The question how to deal with changed updated and deleted entries in databases is

a very tricky problem and the policies for how accession codes and identifiers are

changed or kept constant are not completely consistent between databases or even

over time for one single database

The exact definition of what the identifier and accession code are supposed to denote

varies between the different databases but the basic idea is the following

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 457

Identifier

An identifier (ldquolocusrdquo in GenBank ldquoentry namerdquo in SWISS-PROT) is a string of

letters and digits that generally is interpretable in some meaningful way by a human

for instance as a recognizable abbreviation of the full protein or gene name

SWISS-PROT uses a system where the entry name consists of two parts the first

denotes the protein and the second part denotes the species it is found in For

example KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo

sapiens

An identifier can usually change For example the database curators may decide

that the identifier for an entry no longer is appropriate However this does not

happen very often In fact it happens so rarely that itrsquos not really a big problem

Accession code (number)

An accession code (or number) is a number (possibly with a few characters in

front) that uniquely identifies an entry in its database For example the accession

code for KRAF_HUMAN in SWISS-PROT isP04049

The main conceptual difference from the identifier is that it is supposed to be stable

any given accession code will as soon as it has been issued always refer to that

entry or its ancestors It is often called the primary key for the entry The

accession code once issued must always point to its entry even after large changes

have been made to the entry This means that in discussions about specific database

entries (eg an article about a specific protein) one should always give the

accession code for the entry in the relevant database

In the case where two entries are merged into one single then the new entry

will have both accession codes where one will be theprimary and the other

the secondary accession code When an entry is split into two both new entries

will get new accession codes but will also have the old accession code as secondary

codes

NUCLEOTIDE DATABASES

NCBIrsquos sequence databases accept genome data from sequencing projects

from around the world and serve as the cornerstone of bioinformatics

research

GenBank An annotated collection of all publicly available nucleotide and amino acid

sequences

EST database A collection of expressed sequence tags or short single-pass

sequence reads from mRNA (cDNA)

GSS database A database of genome survey sequences or short single-pass

genomic sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 557

HomoloGene A gene homology tool that compares nucleotide sequences between

pairs of organisms in order to identify putative orthologs

HTG database A collection of high-throughput genome sequences from large-scale

genome sequencing centers including unfinished and finished sequences

SNPs database A central repository for both single-base nucleotide substitutions

and short deletion and insertion polymorphisms RefSeq A database of non-redundant reference sequences standards including

genomic DNA contigs mRNAs and proteins for known genes Multiple collaborations

both within NCBI and with external groups support our data-gathering efforts

STS database A database of sequence tagged sites or short sequences that are

operationally unique in the genome

UniSTS A unified non-redundant view of sequence tagged sites (STSs)

UniGene A collection of ESTs and full-length mRNA sequences organized into

clusters each representing a unique known or putative human gene annotated with

mapping and expression information and cross-references to other sources

DNA amp RNA Databases

Major Sequence Repositories ndash Human Chromosome Information ndash Organelle

Genome Databases ndash RNA Databases ndash Comparative amp Phylogenetic Databases ndash

SNPs Mutations and Variations Databases ndash Alternative Splicing Databases ndash

Specialized Databases

Major Sequence Repositories

DDBJ DNA databank of Japan

EMBL Maintained by EMBLGenBank Maintained by NCBI

Human Chromosome Information

Click the link below to access chromosome information

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

17 18 19 20

21 22 X Y

Organelle Genome Databases

OGMP Organell genome megasequencing program

GOBASE An organelle genome database

MitoMap Human mitochondrial genome database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 657

RNA Databases

Rfam RNA familiy database

RNA base Database of RNA structures

tRNA database Database of tRNAs

tRNA tRNA sequences and genes

sRNA Small RNA database

Comparative amp Phylogenetic Databases

COG Phylogenetic classification of proteins

DHMHD Human-mouse homology database

HomoloGene Gene homologies across species

Homophila Human disease to Drosophila gene database

HOVERGEN Database of homologous vertebrate genes

TreeBase A database of phylogenetic knowledge

XREF Cross-referencing with model organisms

SNPs Mutations amp Variations Databases

ALPSbase Database of mutations causing human ALPS

dbSNP Single nucleotide polymorphism database at NCBI

HGVbase Human Genome Variation database

Alternative Splicing Databases

ASAP Alternate splicing analysis tool at UCLA

ASG Alternate splicing gallery

HASDB Human alternative splicing database at UCLA

AsMamDB alternatively spliced genes in human mouse and rat

ASD Alternative splicing database at CSHL

Specialised Databases

ABIM Links to several genomics database

ACUTS Ancient conserved untranslated sequences

AGSD Animal genome size database

AmiGO The Gene Ontology database

ARGH The acronym database

ASDB Database of alternatively spliced genes

BACPAC BAC and PAC genomic DNA library info

BBID Biological Biochemical image database

Cardiac gene database CHLC Genetic markers on chromosomes

COGENT Complete genome tracking database

COMPEL Composite regulatory elements in eukaryotes

CUTG Codon usage database

dbEST Database of expressed sequences or mRNA

dbGSS Genome survey sequence database

dbSTS Sequence tagged sites (STS)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 757

DBTSS Database of transcriptional start sites

DOGS Database of genome sizes

EID The exon-intron database ndash Harvard

Exon-Intron Exon-Intron database ndash Singapore

EPD Eukaryotic promotor database

FlyTrap HTML based gene expression databaseGDB The genome database

GenLink Resources for human genetic and telomere research

GeneKnockouts Gene knockout information

GENOTK Human cDNA database

GEO Gene expression omnibus NCBI

GOLD Information on genome projects around the world

GSDBThe Genome Sequence DataBase

HGI TIGR human gene index

HTGS High-through-put genomic sequence at NCBI

IMAGE The largest collection of DNA sequences clones

IMGT The international ImMunoGeneTics information system

IPCN Index to Plant Chromosome Numbers database

LocusLink Single query interface to sequence and genetic loci

TelDB The telomere database

MitoDat Mitochondrial nuclear genes

Mouse EST NIA mouse cDNA project

MPSS Searchable databases of several species

NDB Nucleic acid database

NEDO Human cDNA sequence database

NPD Nuclear protein database

Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute

PLACE Database of plant cis-acting regulatory DNA elements

RDP Ribosomal database project

RDB Receptor database at NIHS Japan

Refseq The NCBI reference sequence project

RHdb Radiation hybrid physical map of chromosomes

SHIGAN SHared Information of GENetic resources Japan

SpliceDB Canonical and non-canonical splice site sequences

STACK Consensus human EST database

TAED The adaptive evolution database

TIGR Curated databases of microbes plants and humans

TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database

UniGene Cluster of sequences for unique genes at NCBI

UniSTS Nonredundent collection of STS

Protein Databases

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 857

Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs

and Signatures ndash Others

Protein Sequence Databases

Antibodies Sequence and Structure BRENDA Enzyme database

CD Antigens Database of CD antigens

dbCFC Cytokine family database

Histons Histone sequence database

HPRD Human protein reference database

InterPro Intergrated documentation 5resources for protein families

iProClass An integrated protein classification database

KIND A non-redundant protein sequence database

MHCPEP Database of MHC binding peptides

MIPS Munich information centre for protein sequences

PIR Annotated and non-redundant protein sequence database

PIR-ALN Curated database of protein sequence alignments

PIR-NREF PIR nonredundent reference protein database

PMD Protein mutant database

PRF Protein research foundation Japan

ProClass Non-redundant protein database

ProtoMap Hierarchical classification of swissprot proteins

REBASE Restriction enzyme database

RefSeq Reference sequence database at NCBI

SwissProt Curated protein sequence database

SPTR Comprehensive protein sequence database

Transfac Transcription factor database

TrEMBL Annotated translations of EMBL nucleotide sequences

Tumor gene database Genes with cancer-causing mutations

WD repeats WD-repeat family of proteins

Protein Structure Databases

Cath Protein structure classification

HIV Protease HIV protease database 3D structure

PDB 3-D macromolecular structure data

PSI Protein structure initiative

S2F Structure to function projectScop Structural Classification of Proteins

Protein Domains Motifs amp Signatures

BLOCKS Multipe aligned segments of conserved protein regions

CCD Conserved domain database and search service

DOMO Homologous protein domain families

Pfam Database of protein domains and HMMs

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 957

ProDom Protein domain database

Prints Protein motif fingerprint database

Prosite Database of protein families and domains

SMART Simple modular architecture research tool

TIGRFAM Protein families based on HMMs

Others

Phospho Site Database of phosphorylation sites

PROW Protein reviews on the web

Protein Lounge Complete systems biology

Other Databases

Carbohydarate Databases

Carb DB Carbohydrate Sequence and Structure Database

GlycoWord Glycoscience related information

SPECARB Raman Spectra of carbohydrates

Other Databases

AlzGene Alzheimerrsquos disease

Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia

Model Organism Databases and Resources

Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash

Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash

Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash

Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish

General Information

GMOD Generic Model Organism Database

Model Organisms The WWW virtual library of model organisms

Arabidopsis thaliana

ABRC Arabidopsis biological resource center

AGI Arabidopsis genome initiative

AREX Arabidopsis gene expression database

Arabinet Arabidopsis information on the www

AtGDB An Arabidopsis thalina plant genome database

AtGI TIGR Arabidopsis thaliana gene index

ATGC Genome sequencing at ATGC

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1057

ATIDB Arabidopsis insertion database

CSHL Arabidopsis genome analysis at Cold Spring

ESSA Arabidopsis thalina project at MIPS

Genoscope AGI in France

Kazusa Arabidopsis thaliana genome info Japan

MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center

Stanford Sequencing of the Arabidopsis genome at Stanford

TAIR Arabidopsis information resource

TIGR TIGR Arabidopsis genome annotation database

Wustl Arabidopsis genome at Washington university

Trees A forest tree genome database

Bacterial genomes

B Subtilus Bacillus subtilus database

Chlamydomonas Chlamydomonas genetics center

E coli Ecoli genome project

MGD Microbial germ plasm database

Microbial Microbial Genome Gateway

Microbial Microbial genomes

Micado Genetics maps of B subtilis and E coli

MycDB A integrated Mycobacterial database

Neisseria Neisseria meningitidis genome

Neurospora Neurospora crassa database

OralGen Oral pathogen database

Salmonella Salmonella information

STDGen Sexulally transmitted disease database

Bass

Bass Sea Bass Mapping project

Cat (Felis catus)

Cat ArkDB Cat mapping database

Cattle (Bos taurus)

ARK Farm animals

BoLA Bovine MHC information

Bovin Bovine genome databaseBovMap Mapping the bovine genome

CaDBase Genetic diversity in cattles

ComRad Comparative radiation hybrid mapping

Cow ArkDB Bovine ArkDB

GemQual Genetics of meat quality

Chicken (Gallus gallus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1157

Chicken Poultry gene mapping project

ChickMap Chicken genome project

Chicken ArkDB Chicken database

ChickEST Chick EST database

Poultry Poultry genome project

Cotton

Cotton Cotton data collection site

Cyano Bacteria (Blue green algae)

Cyano Bacteria Anabaena genome

Daphnia (Crustacea)

Daphnia pulex Daphnia genomics consortium

Deer

Deer ArkDB Deer mapping database

Dictyostelium discoideum

Dicty_cDB Dictyostelium discoideum cDNA project

DGP Dictyostelium discoideum genome project

Dictybase Online informatics resources for Dictyostelium

Dog (Canis familiaris)

Dog Dog genome project

Dog genome project

Frog (Xenopus)

Xenbase A Xenopus web resource

Xenopus Xenopus tropicalis genome

Fruit fly (Drosophila melanogaster)

ENSEMBL Drosophila Genome Browser at ENSEMBL

Fruitfly Drosophila genome project at Berkeley

FlyBase A Database of the Drosophila Genome

FlyMove A Drosophila multimedia database

FlyView A Drosophila image database

Fungus

Aspergillus Aspergillus Genomics

Candida Candida albicans information page

FungalWeb Fungi database

FGSC Fungal genetic stocks center

Goat (Capra hircus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 3: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 357

DIP Interacting Proteins (Univ of California)

GDB (Human Genome Organization)

KEGG Functional Db (Univ of Kyoto)

MGI Mouse Genome (Jackson Lab)

OMIM Inherited Diseases (National Center for Biotechnology Information)

SWISS-PROT Protein Db (Swiss Institute of Bioinformatics)

PEDANT Protein Db (Forschungszentrum f Umwelt amp Gesundheit)

List with SNP-Databases

Reactome The Genome Knowledgebase (EBI)

Microarray-DBs

ArrayExpress (European Bioinformatic Institute)

Gene Expression Omnibus (National Center for Biotechnology Information)

maxd (Univ of Manchester)

SMD (Univ of Stanford)

Accession codes Vs identifiers

Many databases in bioinformatics (SWISS-PROT EMBL GenBank Pfam) use a

system where an entry can be identified in two different ways Basically it has two

names

Identifier

Accession code (or number)

The question how to deal with changed updated and deleted entries in databases is

a very tricky problem and the policies for how accession codes and identifiers are

changed or kept constant are not completely consistent between databases or even

over time for one single database

The exact definition of what the identifier and accession code are supposed to denote

varies between the different databases but the basic idea is the following

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 457

Identifier

An identifier (ldquolocusrdquo in GenBank ldquoentry namerdquo in SWISS-PROT) is a string of

letters and digits that generally is interpretable in some meaningful way by a human

for instance as a recognizable abbreviation of the full protein or gene name

SWISS-PROT uses a system where the entry name consists of two parts the first

denotes the protein and the second part denotes the species it is found in For

example KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo

sapiens

An identifier can usually change For example the database curators may decide

that the identifier for an entry no longer is appropriate However this does not

happen very often In fact it happens so rarely that itrsquos not really a big problem

Accession code (number)

An accession code (or number) is a number (possibly with a few characters in

front) that uniquely identifies an entry in its database For example the accession

code for KRAF_HUMAN in SWISS-PROT isP04049

The main conceptual difference from the identifier is that it is supposed to be stable

any given accession code will as soon as it has been issued always refer to that

entry or its ancestors It is often called the primary key for the entry The

accession code once issued must always point to its entry even after large changes

have been made to the entry This means that in discussions about specific database

entries (eg an article about a specific protein) one should always give the

accession code for the entry in the relevant database

In the case where two entries are merged into one single then the new entry

will have both accession codes where one will be theprimary and the other

the secondary accession code When an entry is split into two both new entries

will get new accession codes but will also have the old accession code as secondary

codes

NUCLEOTIDE DATABASES

NCBIrsquos sequence databases accept genome data from sequencing projects

from around the world and serve as the cornerstone of bioinformatics

research

GenBank An annotated collection of all publicly available nucleotide and amino acid

sequences

EST database A collection of expressed sequence tags or short single-pass

sequence reads from mRNA (cDNA)

GSS database A database of genome survey sequences or short single-pass

genomic sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 557

HomoloGene A gene homology tool that compares nucleotide sequences between

pairs of organisms in order to identify putative orthologs

HTG database A collection of high-throughput genome sequences from large-scale

genome sequencing centers including unfinished and finished sequences

SNPs database A central repository for both single-base nucleotide substitutions

and short deletion and insertion polymorphisms RefSeq A database of non-redundant reference sequences standards including

genomic DNA contigs mRNAs and proteins for known genes Multiple collaborations

both within NCBI and with external groups support our data-gathering efforts

STS database A database of sequence tagged sites or short sequences that are

operationally unique in the genome

UniSTS A unified non-redundant view of sequence tagged sites (STSs)

UniGene A collection of ESTs and full-length mRNA sequences organized into

clusters each representing a unique known or putative human gene annotated with

mapping and expression information and cross-references to other sources

DNA amp RNA Databases

Major Sequence Repositories ndash Human Chromosome Information ndash Organelle

Genome Databases ndash RNA Databases ndash Comparative amp Phylogenetic Databases ndash

SNPs Mutations and Variations Databases ndash Alternative Splicing Databases ndash

Specialized Databases

Major Sequence Repositories

DDBJ DNA databank of Japan

EMBL Maintained by EMBLGenBank Maintained by NCBI

Human Chromosome Information

Click the link below to access chromosome information

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

17 18 19 20

21 22 X Y

Organelle Genome Databases

OGMP Organell genome megasequencing program

GOBASE An organelle genome database

MitoMap Human mitochondrial genome database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 657

RNA Databases

Rfam RNA familiy database

RNA base Database of RNA structures

tRNA database Database of tRNAs

tRNA tRNA sequences and genes

sRNA Small RNA database

Comparative amp Phylogenetic Databases

COG Phylogenetic classification of proteins

DHMHD Human-mouse homology database

HomoloGene Gene homologies across species

Homophila Human disease to Drosophila gene database

HOVERGEN Database of homologous vertebrate genes

TreeBase A database of phylogenetic knowledge

XREF Cross-referencing with model organisms

SNPs Mutations amp Variations Databases

ALPSbase Database of mutations causing human ALPS

dbSNP Single nucleotide polymorphism database at NCBI

HGVbase Human Genome Variation database

Alternative Splicing Databases

ASAP Alternate splicing analysis tool at UCLA

ASG Alternate splicing gallery

HASDB Human alternative splicing database at UCLA

AsMamDB alternatively spliced genes in human mouse and rat

ASD Alternative splicing database at CSHL

Specialised Databases

ABIM Links to several genomics database

ACUTS Ancient conserved untranslated sequences

AGSD Animal genome size database

AmiGO The Gene Ontology database

ARGH The acronym database

ASDB Database of alternatively spliced genes

BACPAC BAC and PAC genomic DNA library info

BBID Biological Biochemical image database

Cardiac gene database CHLC Genetic markers on chromosomes

COGENT Complete genome tracking database

COMPEL Composite regulatory elements in eukaryotes

CUTG Codon usage database

dbEST Database of expressed sequences or mRNA

dbGSS Genome survey sequence database

dbSTS Sequence tagged sites (STS)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 757

DBTSS Database of transcriptional start sites

DOGS Database of genome sizes

EID The exon-intron database ndash Harvard

Exon-Intron Exon-Intron database ndash Singapore

EPD Eukaryotic promotor database

FlyTrap HTML based gene expression databaseGDB The genome database

GenLink Resources for human genetic and telomere research

GeneKnockouts Gene knockout information

GENOTK Human cDNA database

GEO Gene expression omnibus NCBI

GOLD Information on genome projects around the world

GSDBThe Genome Sequence DataBase

HGI TIGR human gene index

HTGS High-through-put genomic sequence at NCBI

IMAGE The largest collection of DNA sequences clones

IMGT The international ImMunoGeneTics information system

IPCN Index to Plant Chromosome Numbers database

LocusLink Single query interface to sequence and genetic loci

TelDB The telomere database

MitoDat Mitochondrial nuclear genes

Mouse EST NIA mouse cDNA project

MPSS Searchable databases of several species

NDB Nucleic acid database

NEDO Human cDNA sequence database

NPD Nuclear protein database

Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute

PLACE Database of plant cis-acting regulatory DNA elements

RDP Ribosomal database project

RDB Receptor database at NIHS Japan

Refseq The NCBI reference sequence project

RHdb Radiation hybrid physical map of chromosomes

SHIGAN SHared Information of GENetic resources Japan

SpliceDB Canonical and non-canonical splice site sequences

STACK Consensus human EST database

TAED The adaptive evolution database

TIGR Curated databases of microbes plants and humans

TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database

UniGene Cluster of sequences for unique genes at NCBI

UniSTS Nonredundent collection of STS

Protein Databases

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 857

Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs

and Signatures ndash Others

Protein Sequence Databases

Antibodies Sequence and Structure BRENDA Enzyme database

CD Antigens Database of CD antigens

dbCFC Cytokine family database

Histons Histone sequence database

HPRD Human protein reference database

InterPro Intergrated documentation 5resources for protein families

iProClass An integrated protein classification database

KIND A non-redundant protein sequence database

MHCPEP Database of MHC binding peptides

MIPS Munich information centre for protein sequences

PIR Annotated and non-redundant protein sequence database

PIR-ALN Curated database of protein sequence alignments

PIR-NREF PIR nonredundent reference protein database

PMD Protein mutant database

PRF Protein research foundation Japan

ProClass Non-redundant protein database

ProtoMap Hierarchical classification of swissprot proteins

REBASE Restriction enzyme database

RefSeq Reference sequence database at NCBI

SwissProt Curated protein sequence database

SPTR Comprehensive protein sequence database

Transfac Transcription factor database

TrEMBL Annotated translations of EMBL nucleotide sequences

Tumor gene database Genes with cancer-causing mutations

WD repeats WD-repeat family of proteins

Protein Structure Databases

Cath Protein structure classification

HIV Protease HIV protease database 3D structure

PDB 3-D macromolecular structure data

PSI Protein structure initiative

S2F Structure to function projectScop Structural Classification of Proteins

Protein Domains Motifs amp Signatures

BLOCKS Multipe aligned segments of conserved protein regions

CCD Conserved domain database and search service

DOMO Homologous protein domain families

Pfam Database of protein domains and HMMs

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 957

ProDom Protein domain database

Prints Protein motif fingerprint database

Prosite Database of protein families and domains

SMART Simple modular architecture research tool

TIGRFAM Protein families based on HMMs

Others

Phospho Site Database of phosphorylation sites

PROW Protein reviews on the web

Protein Lounge Complete systems biology

Other Databases

Carbohydarate Databases

Carb DB Carbohydrate Sequence and Structure Database

GlycoWord Glycoscience related information

SPECARB Raman Spectra of carbohydrates

Other Databases

AlzGene Alzheimerrsquos disease

Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia

Model Organism Databases and Resources

Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash

Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash

Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash

Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish

General Information

GMOD Generic Model Organism Database

Model Organisms The WWW virtual library of model organisms

Arabidopsis thaliana

ABRC Arabidopsis biological resource center

AGI Arabidopsis genome initiative

AREX Arabidopsis gene expression database

Arabinet Arabidopsis information on the www

AtGDB An Arabidopsis thalina plant genome database

AtGI TIGR Arabidopsis thaliana gene index

ATGC Genome sequencing at ATGC

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1057

ATIDB Arabidopsis insertion database

CSHL Arabidopsis genome analysis at Cold Spring

ESSA Arabidopsis thalina project at MIPS

Genoscope AGI in France

Kazusa Arabidopsis thaliana genome info Japan

MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center

Stanford Sequencing of the Arabidopsis genome at Stanford

TAIR Arabidopsis information resource

TIGR TIGR Arabidopsis genome annotation database

Wustl Arabidopsis genome at Washington university

Trees A forest tree genome database

Bacterial genomes

B Subtilus Bacillus subtilus database

Chlamydomonas Chlamydomonas genetics center

E coli Ecoli genome project

MGD Microbial germ plasm database

Microbial Microbial Genome Gateway

Microbial Microbial genomes

Micado Genetics maps of B subtilis and E coli

MycDB A integrated Mycobacterial database

Neisseria Neisseria meningitidis genome

Neurospora Neurospora crassa database

OralGen Oral pathogen database

Salmonella Salmonella information

STDGen Sexulally transmitted disease database

Bass

Bass Sea Bass Mapping project

Cat (Felis catus)

Cat ArkDB Cat mapping database

Cattle (Bos taurus)

ARK Farm animals

BoLA Bovine MHC information

Bovin Bovine genome databaseBovMap Mapping the bovine genome

CaDBase Genetic diversity in cattles

ComRad Comparative radiation hybrid mapping

Cow ArkDB Bovine ArkDB

GemQual Genetics of meat quality

Chicken (Gallus gallus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1157

Chicken Poultry gene mapping project

ChickMap Chicken genome project

Chicken ArkDB Chicken database

ChickEST Chick EST database

Poultry Poultry genome project

Cotton

Cotton Cotton data collection site

Cyano Bacteria (Blue green algae)

Cyano Bacteria Anabaena genome

Daphnia (Crustacea)

Daphnia pulex Daphnia genomics consortium

Deer

Deer ArkDB Deer mapping database

Dictyostelium discoideum

Dicty_cDB Dictyostelium discoideum cDNA project

DGP Dictyostelium discoideum genome project

Dictybase Online informatics resources for Dictyostelium

Dog (Canis familiaris)

Dog Dog genome project

Dog genome project

Frog (Xenopus)

Xenbase A Xenopus web resource

Xenopus Xenopus tropicalis genome

Fruit fly (Drosophila melanogaster)

ENSEMBL Drosophila Genome Browser at ENSEMBL

Fruitfly Drosophila genome project at Berkeley

FlyBase A Database of the Drosophila Genome

FlyMove A Drosophila multimedia database

FlyView A Drosophila image database

Fungus

Aspergillus Aspergillus Genomics

Candida Candida albicans information page

FungalWeb Fungi database

FGSC Fungal genetic stocks center

Goat (Capra hircus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 4: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 457

Identifier

An identifier (ldquolocusrdquo in GenBank ldquoentry namerdquo in SWISS-PROT) is a string of

letters and digits that generally is interpretable in some meaningful way by a human

for instance as a recognizable abbreviation of the full protein or gene name

SWISS-PROT uses a system where the entry name consists of two parts the first

denotes the protein and the second part denotes the species it is found in For

example KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo

sapiens

An identifier can usually change For example the database curators may decide

that the identifier for an entry no longer is appropriate However this does not

happen very often In fact it happens so rarely that itrsquos not really a big problem

Accession code (number)

An accession code (or number) is a number (possibly with a few characters in

front) that uniquely identifies an entry in its database For example the accession

code for KRAF_HUMAN in SWISS-PROT isP04049

The main conceptual difference from the identifier is that it is supposed to be stable

any given accession code will as soon as it has been issued always refer to that

entry or its ancestors It is often called the primary key for the entry The

accession code once issued must always point to its entry even after large changes

have been made to the entry This means that in discussions about specific database

entries (eg an article about a specific protein) one should always give the

accession code for the entry in the relevant database

In the case where two entries are merged into one single then the new entry

will have both accession codes where one will be theprimary and the other

the secondary accession code When an entry is split into two both new entries

will get new accession codes but will also have the old accession code as secondary

codes

NUCLEOTIDE DATABASES

NCBIrsquos sequence databases accept genome data from sequencing projects

from around the world and serve as the cornerstone of bioinformatics

research

GenBank An annotated collection of all publicly available nucleotide and amino acid

sequences

EST database A collection of expressed sequence tags or short single-pass

sequence reads from mRNA (cDNA)

GSS database A database of genome survey sequences or short single-pass

genomic sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 557

HomoloGene A gene homology tool that compares nucleotide sequences between

pairs of organisms in order to identify putative orthologs

HTG database A collection of high-throughput genome sequences from large-scale

genome sequencing centers including unfinished and finished sequences

SNPs database A central repository for both single-base nucleotide substitutions

and short deletion and insertion polymorphisms RefSeq A database of non-redundant reference sequences standards including

genomic DNA contigs mRNAs and proteins for known genes Multiple collaborations

both within NCBI and with external groups support our data-gathering efforts

STS database A database of sequence tagged sites or short sequences that are

operationally unique in the genome

UniSTS A unified non-redundant view of sequence tagged sites (STSs)

UniGene A collection of ESTs and full-length mRNA sequences organized into

clusters each representing a unique known or putative human gene annotated with

mapping and expression information and cross-references to other sources

DNA amp RNA Databases

Major Sequence Repositories ndash Human Chromosome Information ndash Organelle

Genome Databases ndash RNA Databases ndash Comparative amp Phylogenetic Databases ndash

SNPs Mutations and Variations Databases ndash Alternative Splicing Databases ndash

Specialized Databases

Major Sequence Repositories

DDBJ DNA databank of Japan

EMBL Maintained by EMBLGenBank Maintained by NCBI

Human Chromosome Information

Click the link below to access chromosome information

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

17 18 19 20

21 22 X Y

Organelle Genome Databases

OGMP Organell genome megasequencing program

GOBASE An organelle genome database

MitoMap Human mitochondrial genome database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 657

RNA Databases

Rfam RNA familiy database

RNA base Database of RNA structures

tRNA database Database of tRNAs

tRNA tRNA sequences and genes

sRNA Small RNA database

Comparative amp Phylogenetic Databases

COG Phylogenetic classification of proteins

DHMHD Human-mouse homology database

HomoloGene Gene homologies across species

Homophila Human disease to Drosophila gene database

HOVERGEN Database of homologous vertebrate genes

TreeBase A database of phylogenetic knowledge

XREF Cross-referencing with model organisms

SNPs Mutations amp Variations Databases

ALPSbase Database of mutations causing human ALPS

dbSNP Single nucleotide polymorphism database at NCBI

HGVbase Human Genome Variation database

Alternative Splicing Databases

ASAP Alternate splicing analysis tool at UCLA

ASG Alternate splicing gallery

HASDB Human alternative splicing database at UCLA

AsMamDB alternatively spliced genes in human mouse and rat

ASD Alternative splicing database at CSHL

Specialised Databases

ABIM Links to several genomics database

ACUTS Ancient conserved untranslated sequences

AGSD Animal genome size database

AmiGO The Gene Ontology database

ARGH The acronym database

ASDB Database of alternatively spliced genes

BACPAC BAC and PAC genomic DNA library info

BBID Biological Biochemical image database

Cardiac gene database CHLC Genetic markers on chromosomes

COGENT Complete genome tracking database

COMPEL Composite regulatory elements in eukaryotes

CUTG Codon usage database

dbEST Database of expressed sequences or mRNA

dbGSS Genome survey sequence database

dbSTS Sequence tagged sites (STS)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 757

DBTSS Database of transcriptional start sites

DOGS Database of genome sizes

EID The exon-intron database ndash Harvard

Exon-Intron Exon-Intron database ndash Singapore

EPD Eukaryotic promotor database

FlyTrap HTML based gene expression databaseGDB The genome database

GenLink Resources for human genetic and telomere research

GeneKnockouts Gene knockout information

GENOTK Human cDNA database

GEO Gene expression omnibus NCBI

GOLD Information on genome projects around the world

GSDBThe Genome Sequence DataBase

HGI TIGR human gene index

HTGS High-through-put genomic sequence at NCBI

IMAGE The largest collection of DNA sequences clones

IMGT The international ImMunoGeneTics information system

IPCN Index to Plant Chromosome Numbers database

LocusLink Single query interface to sequence and genetic loci

TelDB The telomere database

MitoDat Mitochondrial nuclear genes

Mouse EST NIA mouse cDNA project

MPSS Searchable databases of several species

NDB Nucleic acid database

NEDO Human cDNA sequence database

NPD Nuclear protein database

Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute

PLACE Database of plant cis-acting regulatory DNA elements

RDP Ribosomal database project

RDB Receptor database at NIHS Japan

Refseq The NCBI reference sequence project

RHdb Radiation hybrid physical map of chromosomes

SHIGAN SHared Information of GENetic resources Japan

SpliceDB Canonical and non-canonical splice site sequences

STACK Consensus human EST database

TAED The adaptive evolution database

TIGR Curated databases of microbes plants and humans

TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database

UniGene Cluster of sequences for unique genes at NCBI

UniSTS Nonredundent collection of STS

Protein Databases

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 857

Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs

and Signatures ndash Others

Protein Sequence Databases

Antibodies Sequence and Structure BRENDA Enzyme database

CD Antigens Database of CD antigens

dbCFC Cytokine family database

Histons Histone sequence database

HPRD Human protein reference database

InterPro Intergrated documentation 5resources for protein families

iProClass An integrated protein classification database

KIND A non-redundant protein sequence database

MHCPEP Database of MHC binding peptides

MIPS Munich information centre for protein sequences

PIR Annotated and non-redundant protein sequence database

PIR-ALN Curated database of protein sequence alignments

PIR-NREF PIR nonredundent reference protein database

PMD Protein mutant database

PRF Protein research foundation Japan

ProClass Non-redundant protein database

ProtoMap Hierarchical classification of swissprot proteins

REBASE Restriction enzyme database

RefSeq Reference sequence database at NCBI

SwissProt Curated protein sequence database

SPTR Comprehensive protein sequence database

Transfac Transcription factor database

TrEMBL Annotated translations of EMBL nucleotide sequences

Tumor gene database Genes with cancer-causing mutations

WD repeats WD-repeat family of proteins

Protein Structure Databases

Cath Protein structure classification

HIV Protease HIV protease database 3D structure

PDB 3-D macromolecular structure data

PSI Protein structure initiative

S2F Structure to function projectScop Structural Classification of Proteins

Protein Domains Motifs amp Signatures

BLOCKS Multipe aligned segments of conserved protein regions

CCD Conserved domain database and search service

DOMO Homologous protein domain families

Pfam Database of protein domains and HMMs

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 957

ProDom Protein domain database

Prints Protein motif fingerprint database

Prosite Database of protein families and domains

SMART Simple modular architecture research tool

TIGRFAM Protein families based on HMMs

Others

Phospho Site Database of phosphorylation sites

PROW Protein reviews on the web

Protein Lounge Complete systems biology

Other Databases

Carbohydarate Databases

Carb DB Carbohydrate Sequence and Structure Database

GlycoWord Glycoscience related information

SPECARB Raman Spectra of carbohydrates

Other Databases

AlzGene Alzheimerrsquos disease

Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia

Model Organism Databases and Resources

Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash

Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash

Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash

Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish

General Information

GMOD Generic Model Organism Database

Model Organisms The WWW virtual library of model organisms

Arabidopsis thaliana

ABRC Arabidopsis biological resource center

AGI Arabidopsis genome initiative

AREX Arabidopsis gene expression database

Arabinet Arabidopsis information on the www

AtGDB An Arabidopsis thalina plant genome database

AtGI TIGR Arabidopsis thaliana gene index

ATGC Genome sequencing at ATGC

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1057

ATIDB Arabidopsis insertion database

CSHL Arabidopsis genome analysis at Cold Spring

ESSA Arabidopsis thalina project at MIPS

Genoscope AGI in France

Kazusa Arabidopsis thaliana genome info Japan

MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center

Stanford Sequencing of the Arabidopsis genome at Stanford

TAIR Arabidopsis information resource

TIGR TIGR Arabidopsis genome annotation database

Wustl Arabidopsis genome at Washington university

Trees A forest tree genome database

Bacterial genomes

B Subtilus Bacillus subtilus database

Chlamydomonas Chlamydomonas genetics center

E coli Ecoli genome project

MGD Microbial germ plasm database

Microbial Microbial Genome Gateway

Microbial Microbial genomes

Micado Genetics maps of B subtilis and E coli

MycDB A integrated Mycobacterial database

Neisseria Neisseria meningitidis genome

Neurospora Neurospora crassa database

OralGen Oral pathogen database

Salmonella Salmonella information

STDGen Sexulally transmitted disease database

Bass

Bass Sea Bass Mapping project

Cat (Felis catus)

Cat ArkDB Cat mapping database

Cattle (Bos taurus)

ARK Farm animals

BoLA Bovine MHC information

Bovin Bovine genome databaseBovMap Mapping the bovine genome

CaDBase Genetic diversity in cattles

ComRad Comparative radiation hybrid mapping

Cow ArkDB Bovine ArkDB

GemQual Genetics of meat quality

Chicken (Gallus gallus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1157

Chicken Poultry gene mapping project

ChickMap Chicken genome project

Chicken ArkDB Chicken database

ChickEST Chick EST database

Poultry Poultry genome project

Cotton

Cotton Cotton data collection site

Cyano Bacteria (Blue green algae)

Cyano Bacteria Anabaena genome

Daphnia (Crustacea)

Daphnia pulex Daphnia genomics consortium

Deer

Deer ArkDB Deer mapping database

Dictyostelium discoideum

Dicty_cDB Dictyostelium discoideum cDNA project

DGP Dictyostelium discoideum genome project

Dictybase Online informatics resources for Dictyostelium

Dog (Canis familiaris)

Dog Dog genome project

Dog genome project

Frog (Xenopus)

Xenbase A Xenopus web resource

Xenopus Xenopus tropicalis genome

Fruit fly (Drosophila melanogaster)

ENSEMBL Drosophila Genome Browser at ENSEMBL

Fruitfly Drosophila genome project at Berkeley

FlyBase A Database of the Drosophila Genome

FlyMove A Drosophila multimedia database

FlyView A Drosophila image database

Fungus

Aspergillus Aspergillus Genomics

Candida Candida albicans information page

FungalWeb Fungi database

FGSC Fungal genetic stocks center

Goat (Capra hircus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 5: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 557

HomoloGene A gene homology tool that compares nucleotide sequences between

pairs of organisms in order to identify putative orthologs

HTG database A collection of high-throughput genome sequences from large-scale

genome sequencing centers including unfinished and finished sequences

SNPs database A central repository for both single-base nucleotide substitutions

and short deletion and insertion polymorphisms RefSeq A database of non-redundant reference sequences standards including

genomic DNA contigs mRNAs and proteins for known genes Multiple collaborations

both within NCBI and with external groups support our data-gathering efforts

STS database A database of sequence tagged sites or short sequences that are

operationally unique in the genome

UniSTS A unified non-redundant view of sequence tagged sites (STSs)

UniGene A collection of ESTs and full-length mRNA sequences organized into

clusters each representing a unique known or putative human gene annotated with

mapping and expression information and cross-references to other sources

DNA amp RNA Databases

Major Sequence Repositories ndash Human Chromosome Information ndash Organelle

Genome Databases ndash RNA Databases ndash Comparative amp Phylogenetic Databases ndash

SNPs Mutations and Variations Databases ndash Alternative Splicing Databases ndash

Specialized Databases

Major Sequence Repositories

DDBJ DNA databank of Japan

EMBL Maintained by EMBLGenBank Maintained by NCBI

Human Chromosome Information

Click the link below to access chromosome information

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

17 18 19 20

21 22 X Y

Organelle Genome Databases

OGMP Organell genome megasequencing program

GOBASE An organelle genome database

MitoMap Human mitochondrial genome database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 657

RNA Databases

Rfam RNA familiy database

RNA base Database of RNA structures

tRNA database Database of tRNAs

tRNA tRNA sequences and genes

sRNA Small RNA database

Comparative amp Phylogenetic Databases

COG Phylogenetic classification of proteins

DHMHD Human-mouse homology database

HomoloGene Gene homologies across species

Homophila Human disease to Drosophila gene database

HOVERGEN Database of homologous vertebrate genes

TreeBase A database of phylogenetic knowledge

XREF Cross-referencing with model organisms

SNPs Mutations amp Variations Databases

ALPSbase Database of mutations causing human ALPS

dbSNP Single nucleotide polymorphism database at NCBI

HGVbase Human Genome Variation database

Alternative Splicing Databases

ASAP Alternate splicing analysis tool at UCLA

ASG Alternate splicing gallery

HASDB Human alternative splicing database at UCLA

AsMamDB alternatively spliced genes in human mouse and rat

ASD Alternative splicing database at CSHL

Specialised Databases

ABIM Links to several genomics database

ACUTS Ancient conserved untranslated sequences

AGSD Animal genome size database

AmiGO The Gene Ontology database

ARGH The acronym database

ASDB Database of alternatively spliced genes

BACPAC BAC and PAC genomic DNA library info

BBID Biological Biochemical image database

Cardiac gene database CHLC Genetic markers on chromosomes

COGENT Complete genome tracking database

COMPEL Composite regulatory elements in eukaryotes

CUTG Codon usage database

dbEST Database of expressed sequences or mRNA

dbGSS Genome survey sequence database

dbSTS Sequence tagged sites (STS)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 757

DBTSS Database of transcriptional start sites

DOGS Database of genome sizes

EID The exon-intron database ndash Harvard

Exon-Intron Exon-Intron database ndash Singapore

EPD Eukaryotic promotor database

FlyTrap HTML based gene expression databaseGDB The genome database

GenLink Resources for human genetic and telomere research

GeneKnockouts Gene knockout information

GENOTK Human cDNA database

GEO Gene expression omnibus NCBI

GOLD Information on genome projects around the world

GSDBThe Genome Sequence DataBase

HGI TIGR human gene index

HTGS High-through-put genomic sequence at NCBI

IMAGE The largest collection of DNA sequences clones

IMGT The international ImMunoGeneTics information system

IPCN Index to Plant Chromosome Numbers database

LocusLink Single query interface to sequence and genetic loci

TelDB The telomere database

MitoDat Mitochondrial nuclear genes

Mouse EST NIA mouse cDNA project

MPSS Searchable databases of several species

NDB Nucleic acid database

NEDO Human cDNA sequence database

NPD Nuclear protein database

Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute

PLACE Database of plant cis-acting regulatory DNA elements

RDP Ribosomal database project

RDB Receptor database at NIHS Japan

Refseq The NCBI reference sequence project

RHdb Radiation hybrid physical map of chromosomes

SHIGAN SHared Information of GENetic resources Japan

SpliceDB Canonical and non-canonical splice site sequences

STACK Consensus human EST database

TAED The adaptive evolution database

TIGR Curated databases of microbes plants and humans

TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database

UniGene Cluster of sequences for unique genes at NCBI

UniSTS Nonredundent collection of STS

Protein Databases

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 857

Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs

and Signatures ndash Others

Protein Sequence Databases

Antibodies Sequence and Structure BRENDA Enzyme database

CD Antigens Database of CD antigens

dbCFC Cytokine family database

Histons Histone sequence database

HPRD Human protein reference database

InterPro Intergrated documentation 5resources for protein families

iProClass An integrated protein classification database

KIND A non-redundant protein sequence database

MHCPEP Database of MHC binding peptides

MIPS Munich information centre for protein sequences

PIR Annotated and non-redundant protein sequence database

PIR-ALN Curated database of protein sequence alignments

PIR-NREF PIR nonredundent reference protein database

PMD Protein mutant database

PRF Protein research foundation Japan

ProClass Non-redundant protein database

ProtoMap Hierarchical classification of swissprot proteins

REBASE Restriction enzyme database

RefSeq Reference sequence database at NCBI

SwissProt Curated protein sequence database

SPTR Comprehensive protein sequence database

Transfac Transcription factor database

TrEMBL Annotated translations of EMBL nucleotide sequences

Tumor gene database Genes with cancer-causing mutations

WD repeats WD-repeat family of proteins

Protein Structure Databases

Cath Protein structure classification

HIV Protease HIV protease database 3D structure

PDB 3-D macromolecular structure data

PSI Protein structure initiative

S2F Structure to function projectScop Structural Classification of Proteins

Protein Domains Motifs amp Signatures

BLOCKS Multipe aligned segments of conserved protein regions

CCD Conserved domain database and search service

DOMO Homologous protein domain families

Pfam Database of protein domains and HMMs

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 957

ProDom Protein domain database

Prints Protein motif fingerprint database

Prosite Database of protein families and domains

SMART Simple modular architecture research tool

TIGRFAM Protein families based on HMMs

Others

Phospho Site Database of phosphorylation sites

PROW Protein reviews on the web

Protein Lounge Complete systems biology

Other Databases

Carbohydarate Databases

Carb DB Carbohydrate Sequence and Structure Database

GlycoWord Glycoscience related information

SPECARB Raman Spectra of carbohydrates

Other Databases

AlzGene Alzheimerrsquos disease

Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia

Model Organism Databases and Resources

Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash

Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash

Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash

Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish

General Information

GMOD Generic Model Organism Database

Model Organisms The WWW virtual library of model organisms

Arabidopsis thaliana

ABRC Arabidopsis biological resource center

AGI Arabidopsis genome initiative

AREX Arabidopsis gene expression database

Arabinet Arabidopsis information on the www

AtGDB An Arabidopsis thalina plant genome database

AtGI TIGR Arabidopsis thaliana gene index

ATGC Genome sequencing at ATGC

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1057

ATIDB Arabidopsis insertion database

CSHL Arabidopsis genome analysis at Cold Spring

ESSA Arabidopsis thalina project at MIPS

Genoscope AGI in France

Kazusa Arabidopsis thaliana genome info Japan

MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center

Stanford Sequencing of the Arabidopsis genome at Stanford

TAIR Arabidopsis information resource

TIGR TIGR Arabidopsis genome annotation database

Wustl Arabidopsis genome at Washington university

Trees A forest tree genome database

Bacterial genomes

B Subtilus Bacillus subtilus database

Chlamydomonas Chlamydomonas genetics center

E coli Ecoli genome project

MGD Microbial germ plasm database

Microbial Microbial Genome Gateway

Microbial Microbial genomes

Micado Genetics maps of B subtilis and E coli

MycDB A integrated Mycobacterial database

Neisseria Neisseria meningitidis genome

Neurospora Neurospora crassa database

OralGen Oral pathogen database

Salmonella Salmonella information

STDGen Sexulally transmitted disease database

Bass

Bass Sea Bass Mapping project

Cat (Felis catus)

Cat ArkDB Cat mapping database

Cattle (Bos taurus)

ARK Farm animals

BoLA Bovine MHC information

Bovin Bovine genome databaseBovMap Mapping the bovine genome

CaDBase Genetic diversity in cattles

ComRad Comparative radiation hybrid mapping

Cow ArkDB Bovine ArkDB

GemQual Genetics of meat quality

Chicken (Gallus gallus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1157

Chicken Poultry gene mapping project

ChickMap Chicken genome project

Chicken ArkDB Chicken database

ChickEST Chick EST database

Poultry Poultry genome project

Cotton

Cotton Cotton data collection site

Cyano Bacteria (Blue green algae)

Cyano Bacteria Anabaena genome

Daphnia (Crustacea)

Daphnia pulex Daphnia genomics consortium

Deer

Deer ArkDB Deer mapping database

Dictyostelium discoideum

Dicty_cDB Dictyostelium discoideum cDNA project

DGP Dictyostelium discoideum genome project

Dictybase Online informatics resources for Dictyostelium

Dog (Canis familiaris)

Dog Dog genome project

Dog genome project

Frog (Xenopus)

Xenbase A Xenopus web resource

Xenopus Xenopus tropicalis genome

Fruit fly (Drosophila melanogaster)

ENSEMBL Drosophila Genome Browser at ENSEMBL

Fruitfly Drosophila genome project at Berkeley

FlyBase A Database of the Drosophila Genome

FlyMove A Drosophila multimedia database

FlyView A Drosophila image database

Fungus

Aspergillus Aspergillus Genomics

Candida Candida albicans information page

FungalWeb Fungi database

FGSC Fungal genetic stocks center

Goat (Capra hircus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 6: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 657

RNA Databases

Rfam RNA familiy database

RNA base Database of RNA structures

tRNA database Database of tRNAs

tRNA tRNA sequences and genes

sRNA Small RNA database

Comparative amp Phylogenetic Databases

COG Phylogenetic classification of proteins

DHMHD Human-mouse homology database

HomoloGene Gene homologies across species

Homophila Human disease to Drosophila gene database

HOVERGEN Database of homologous vertebrate genes

TreeBase A database of phylogenetic knowledge

XREF Cross-referencing with model organisms

SNPs Mutations amp Variations Databases

ALPSbase Database of mutations causing human ALPS

dbSNP Single nucleotide polymorphism database at NCBI

HGVbase Human Genome Variation database

Alternative Splicing Databases

ASAP Alternate splicing analysis tool at UCLA

ASG Alternate splicing gallery

HASDB Human alternative splicing database at UCLA

AsMamDB alternatively spliced genes in human mouse and rat

ASD Alternative splicing database at CSHL

Specialised Databases

ABIM Links to several genomics database

ACUTS Ancient conserved untranslated sequences

AGSD Animal genome size database

AmiGO The Gene Ontology database

ARGH The acronym database

ASDB Database of alternatively spliced genes

BACPAC BAC and PAC genomic DNA library info

BBID Biological Biochemical image database

Cardiac gene database CHLC Genetic markers on chromosomes

COGENT Complete genome tracking database

COMPEL Composite regulatory elements in eukaryotes

CUTG Codon usage database

dbEST Database of expressed sequences or mRNA

dbGSS Genome survey sequence database

dbSTS Sequence tagged sites (STS)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 757

DBTSS Database of transcriptional start sites

DOGS Database of genome sizes

EID The exon-intron database ndash Harvard

Exon-Intron Exon-Intron database ndash Singapore

EPD Eukaryotic promotor database

FlyTrap HTML based gene expression databaseGDB The genome database

GenLink Resources for human genetic and telomere research

GeneKnockouts Gene knockout information

GENOTK Human cDNA database

GEO Gene expression omnibus NCBI

GOLD Information on genome projects around the world

GSDBThe Genome Sequence DataBase

HGI TIGR human gene index

HTGS High-through-put genomic sequence at NCBI

IMAGE The largest collection of DNA sequences clones

IMGT The international ImMunoGeneTics information system

IPCN Index to Plant Chromosome Numbers database

LocusLink Single query interface to sequence and genetic loci

TelDB The telomere database

MitoDat Mitochondrial nuclear genes

Mouse EST NIA mouse cDNA project

MPSS Searchable databases of several species

NDB Nucleic acid database

NEDO Human cDNA sequence database

NPD Nuclear protein database

Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute

PLACE Database of plant cis-acting regulatory DNA elements

RDP Ribosomal database project

RDB Receptor database at NIHS Japan

Refseq The NCBI reference sequence project

RHdb Radiation hybrid physical map of chromosomes

SHIGAN SHared Information of GENetic resources Japan

SpliceDB Canonical and non-canonical splice site sequences

STACK Consensus human EST database

TAED The adaptive evolution database

TIGR Curated databases of microbes plants and humans

TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database

UniGene Cluster of sequences for unique genes at NCBI

UniSTS Nonredundent collection of STS

Protein Databases

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 857

Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs

and Signatures ndash Others

Protein Sequence Databases

Antibodies Sequence and Structure BRENDA Enzyme database

CD Antigens Database of CD antigens

dbCFC Cytokine family database

Histons Histone sequence database

HPRD Human protein reference database

InterPro Intergrated documentation 5resources for protein families

iProClass An integrated protein classification database

KIND A non-redundant protein sequence database

MHCPEP Database of MHC binding peptides

MIPS Munich information centre for protein sequences

PIR Annotated and non-redundant protein sequence database

PIR-ALN Curated database of protein sequence alignments

PIR-NREF PIR nonredundent reference protein database

PMD Protein mutant database

PRF Protein research foundation Japan

ProClass Non-redundant protein database

ProtoMap Hierarchical classification of swissprot proteins

REBASE Restriction enzyme database

RefSeq Reference sequence database at NCBI

SwissProt Curated protein sequence database

SPTR Comprehensive protein sequence database

Transfac Transcription factor database

TrEMBL Annotated translations of EMBL nucleotide sequences

Tumor gene database Genes with cancer-causing mutations

WD repeats WD-repeat family of proteins

Protein Structure Databases

Cath Protein structure classification

HIV Protease HIV protease database 3D structure

PDB 3-D macromolecular structure data

PSI Protein structure initiative

S2F Structure to function projectScop Structural Classification of Proteins

Protein Domains Motifs amp Signatures

BLOCKS Multipe aligned segments of conserved protein regions

CCD Conserved domain database and search service

DOMO Homologous protein domain families

Pfam Database of protein domains and HMMs

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 957

ProDom Protein domain database

Prints Protein motif fingerprint database

Prosite Database of protein families and domains

SMART Simple modular architecture research tool

TIGRFAM Protein families based on HMMs

Others

Phospho Site Database of phosphorylation sites

PROW Protein reviews on the web

Protein Lounge Complete systems biology

Other Databases

Carbohydarate Databases

Carb DB Carbohydrate Sequence and Structure Database

GlycoWord Glycoscience related information

SPECARB Raman Spectra of carbohydrates

Other Databases

AlzGene Alzheimerrsquos disease

Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia

Model Organism Databases and Resources

Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash

Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash

Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash

Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish

General Information

GMOD Generic Model Organism Database

Model Organisms The WWW virtual library of model organisms

Arabidopsis thaliana

ABRC Arabidopsis biological resource center

AGI Arabidopsis genome initiative

AREX Arabidopsis gene expression database

Arabinet Arabidopsis information on the www

AtGDB An Arabidopsis thalina plant genome database

AtGI TIGR Arabidopsis thaliana gene index

ATGC Genome sequencing at ATGC

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1057

ATIDB Arabidopsis insertion database

CSHL Arabidopsis genome analysis at Cold Spring

ESSA Arabidopsis thalina project at MIPS

Genoscope AGI in France

Kazusa Arabidopsis thaliana genome info Japan

MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center

Stanford Sequencing of the Arabidopsis genome at Stanford

TAIR Arabidopsis information resource

TIGR TIGR Arabidopsis genome annotation database

Wustl Arabidopsis genome at Washington university

Trees A forest tree genome database

Bacterial genomes

B Subtilus Bacillus subtilus database

Chlamydomonas Chlamydomonas genetics center

E coli Ecoli genome project

MGD Microbial germ plasm database

Microbial Microbial Genome Gateway

Microbial Microbial genomes

Micado Genetics maps of B subtilis and E coli

MycDB A integrated Mycobacterial database

Neisseria Neisseria meningitidis genome

Neurospora Neurospora crassa database

OralGen Oral pathogen database

Salmonella Salmonella information

STDGen Sexulally transmitted disease database

Bass

Bass Sea Bass Mapping project

Cat (Felis catus)

Cat ArkDB Cat mapping database

Cattle (Bos taurus)

ARK Farm animals

BoLA Bovine MHC information

Bovin Bovine genome databaseBovMap Mapping the bovine genome

CaDBase Genetic diversity in cattles

ComRad Comparative radiation hybrid mapping

Cow ArkDB Bovine ArkDB

GemQual Genetics of meat quality

Chicken (Gallus gallus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1157

Chicken Poultry gene mapping project

ChickMap Chicken genome project

Chicken ArkDB Chicken database

ChickEST Chick EST database

Poultry Poultry genome project

Cotton

Cotton Cotton data collection site

Cyano Bacteria (Blue green algae)

Cyano Bacteria Anabaena genome

Daphnia (Crustacea)

Daphnia pulex Daphnia genomics consortium

Deer

Deer ArkDB Deer mapping database

Dictyostelium discoideum

Dicty_cDB Dictyostelium discoideum cDNA project

DGP Dictyostelium discoideum genome project

Dictybase Online informatics resources for Dictyostelium

Dog (Canis familiaris)

Dog Dog genome project

Dog genome project

Frog (Xenopus)

Xenbase A Xenopus web resource

Xenopus Xenopus tropicalis genome

Fruit fly (Drosophila melanogaster)

ENSEMBL Drosophila Genome Browser at ENSEMBL

Fruitfly Drosophila genome project at Berkeley

FlyBase A Database of the Drosophila Genome

FlyMove A Drosophila multimedia database

FlyView A Drosophila image database

Fungus

Aspergillus Aspergillus Genomics

Candida Candida albicans information page

FungalWeb Fungi database

FGSC Fungal genetic stocks center

Goat (Capra hircus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 7: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 757

DBTSS Database of transcriptional start sites

DOGS Database of genome sizes

EID The exon-intron database ndash Harvard

Exon-Intron Exon-Intron database ndash Singapore

EPD Eukaryotic promotor database

FlyTrap HTML based gene expression databaseGDB The genome database

GenLink Resources for human genetic and telomere research

GeneKnockouts Gene knockout information

GENOTK Human cDNA database

GEO Gene expression omnibus NCBI

GOLD Information on genome projects around the world

GSDBThe Genome Sequence DataBase

HGI TIGR human gene index

HTGS High-through-put genomic sequence at NCBI

IMAGE The largest collection of DNA sequences clones

IMGT The international ImMunoGeneTics information system

IPCN Index to Plant Chromosome Numbers database

LocusLink Single query interface to sequence and genetic loci

TelDB The telomere database

MitoDat Mitochondrial nuclear genes

Mouse EST NIA mouse cDNA project

MPSS Searchable databases of several species

NDB Nucleic acid database

NEDO Human cDNA sequence database

NPD Nuclear protein database

Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute

PLACE Database of plant cis-acting regulatory DNA elements

RDP Ribosomal database project

RDB Receptor database at NIHS Japan

Refseq The NCBI reference sequence project

RHdb Radiation hybrid physical map of chromosomes

SHIGAN SHared Information of GENetic resources Japan

SpliceDB Canonical and non-canonical splice site sequences

STACK Consensus human EST database

TAED The adaptive evolution database

TIGR Curated databases of microbes plants and humans

TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database

UniGene Cluster of sequences for unique genes at NCBI

UniSTS Nonredundent collection of STS

Protein Databases

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 857

Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs

and Signatures ndash Others

Protein Sequence Databases

Antibodies Sequence and Structure BRENDA Enzyme database

CD Antigens Database of CD antigens

dbCFC Cytokine family database

Histons Histone sequence database

HPRD Human protein reference database

InterPro Intergrated documentation 5resources for protein families

iProClass An integrated protein classification database

KIND A non-redundant protein sequence database

MHCPEP Database of MHC binding peptides

MIPS Munich information centre for protein sequences

PIR Annotated and non-redundant protein sequence database

PIR-ALN Curated database of protein sequence alignments

PIR-NREF PIR nonredundent reference protein database

PMD Protein mutant database

PRF Protein research foundation Japan

ProClass Non-redundant protein database

ProtoMap Hierarchical classification of swissprot proteins

REBASE Restriction enzyme database

RefSeq Reference sequence database at NCBI

SwissProt Curated protein sequence database

SPTR Comprehensive protein sequence database

Transfac Transcription factor database

TrEMBL Annotated translations of EMBL nucleotide sequences

Tumor gene database Genes with cancer-causing mutations

WD repeats WD-repeat family of proteins

Protein Structure Databases

Cath Protein structure classification

HIV Protease HIV protease database 3D structure

PDB 3-D macromolecular structure data

PSI Protein structure initiative

S2F Structure to function projectScop Structural Classification of Proteins

Protein Domains Motifs amp Signatures

BLOCKS Multipe aligned segments of conserved protein regions

CCD Conserved domain database and search service

DOMO Homologous protein domain families

Pfam Database of protein domains and HMMs

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 957

ProDom Protein domain database

Prints Protein motif fingerprint database

Prosite Database of protein families and domains

SMART Simple modular architecture research tool

TIGRFAM Protein families based on HMMs

Others

Phospho Site Database of phosphorylation sites

PROW Protein reviews on the web

Protein Lounge Complete systems biology

Other Databases

Carbohydarate Databases

Carb DB Carbohydrate Sequence and Structure Database

GlycoWord Glycoscience related information

SPECARB Raman Spectra of carbohydrates

Other Databases

AlzGene Alzheimerrsquos disease

Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia

Model Organism Databases and Resources

Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash

Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash

Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash

Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish

General Information

GMOD Generic Model Organism Database

Model Organisms The WWW virtual library of model organisms

Arabidopsis thaliana

ABRC Arabidopsis biological resource center

AGI Arabidopsis genome initiative

AREX Arabidopsis gene expression database

Arabinet Arabidopsis information on the www

AtGDB An Arabidopsis thalina plant genome database

AtGI TIGR Arabidopsis thaliana gene index

ATGC Genome sequencing at ATGC

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1057

ATIDB Arabidopsis insertion database

CSHL Arabidopsis genome analysis at Cold Spring

ESSA Arabidopsis thalina project at MIPS

Genoscope AGI in France

Kazusa Arabidopsis thaliana genome info Japan

MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center

Stanford Sequencing of the Arabidopsis genome at Stanford

TAIR Arabidopsis information resource

TIGR TIGR Arabidopsis genome annotation database

Wustl Arabidopsis genome at Washington university

Trees A forest tree genome database

Bacterial genomes

B Subtilus Bacillus subtilus database

Chlamydomonas Chlamydomonas genetics center

E coli Ecoli genome project

MGD Microbial germ plasm database

Microbial Microbial Genome Gateway

Microbial Microbial genomes

Micado Genetics maps of B subtilis and E coli

MycDB A integrated Mycobacterial database

Neisseria Neisseria meningitidis genome

Neurospora Neurospora crassa database

OralGen Oral pathogen database

Salmonella Salmonella information

STDGen Sexulally transmitted disease database

Bass

Bass Sea Bass Mapping project

Cat (Felis catus)

Cat ArkDB Cat mapping database

Cattle (Bos taurus)

ARK Farm animals

BoLA Bovine MHC information

Bovin Bovine genome databaseBovMap Mapping the bovine genome

CaDBase Genetic diversity in cattles

ComRad Comparative radiation hybrid mapping

Cow ArkDB Bovine ArkDB

GemQual Genetics of meat quality

Chicken (Gallus gallus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1157

Chicken Poultry gene mapping project

ChickMap Chicken genome project

Chicken ArkDB Chicken database

ChickEST Chick EST database

Poultry Poultry genome project

Cotton

Cotton Cotton data collection site

Cyano Bacteria (Blue green algae)

Cyano Bacteria Anabaena genome

Daphnia (Crustacea)

Daphnia pulex Daphnia genomics consortium

Deer

Deer ArkDB Deer mapping database

Dictyostelium discoideum

Dicty_cDB Dictyostelium discoideum cDNA project

DGP Dictyostelium discoideum genome project

Dictybase Online informatics resources for Dictyostelium

Dog (Canis familiaris)

Dog Dog genome project

Dog genome project

Frog (Xenopus)

Xenbase A Xenopus web resource

Xenopus Xenopus tropicalis genome

Fruit fly (Drosophila melanogaster)

ENSEMBL Drosophila Genome Browser at ENSEMBL

Fruitfly Drosophila genome project at Berkeley

FlyBase A Database of the Drosophila Genome

FlyMove A Drosophila multimedia database

FlyView A Drosophila image database

Fungus

Aspergillus Aspergillus Genomics

Candida Candida albicans information page

FungalWeb Fungi database

FGSC Fungal genetic stocks center

Goat (Capra hircus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 8: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 857

Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs

and Signatures ndash Others

Protein Sequence Databases

Antibodies Sequence and Structure BRENDA Enzyme database

CD Antigens Database of CD antigens

dbCFC Cytokine family database

Histons Histone sequence database

HPRD Human protein reference database

InterPro Intergrated documentation 5resources for protein families

iProClass An integrated protein classification database

KIND A non-redundant protein sequence database

MHCPEP Database of MHC binding peptides

MIPS Munich information centre for protein sequences

PIR Annotated and non-redundant protein sequence database

PIR-ALN Curated database of protein sequence alignments

PIR-NREF PIR nonredundent reference protein database

PMD Protein mutant database

PRF Protein research foundation Japan

ProClass Non-redundant protein database

ProtoMap Hierarchical classification of swissprot proteins

REBASE Restriction enzyme database

RefSeq Reference sequence database at NCBI

SwissProt Curated protein sequence database

SPTR Comprehensive protein sequence database

Transfac Transcription factor database

TrEMBL Annotated translations of EMBL nucleotide sequences

Tumor gene database Genes with cancer-causing mutations

WD repeats WD-repeat family of proteins

Protein Structure Databases

Cath Protein structure classification

HIV Protease HIV protease database 3D structure

PDB 3-D macromolecular structure data

PSI Protein structure initiative

S2F Structure to function projectScop Structural Classification of Proteins

Protein Domains Motifs amp Signatures

BLOCKS Multipe aligned segments of conserved protein regions

CCD Conserved domain database and search service

DOMO Homologous protein domain families

Pfam Database of protein domains and HMMs

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 957

ProDom Protein domain database

Prints Protein motif fingerprint database

Prosite Database of protein families and domains

SMART Simple modular architecture research tool

TIGRFAM Protein families based on HMMs

Others

Phospho Site Database of phosphorylation sites

PROW Protein reviews on the web

Protein Lounge Complete systems biology

Other Databases

Carbohydarate Databases

Carb DB Carbohydrate Sequence and Structure Database

GlycoWord Glycoscience related information

SPECARB Raman Spectra of carbohydrates

Other Databases

AlzGene Alzheimerrsquos disease

Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia

Model Organism Databases and Resources

Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash

Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash

Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash

Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish

General Information

GMOD Generic Model Organism Database

Model Organisms The WWW virtual library of model organisms

Arabidopsis thaliana

ABRC Arabidopsis biological resource center

AGI Arabidopsis genome initiative

AREX Arabidopsis gene expression database

Arabinet Arabidopsis information on the www

AtGDB An Arabidopsis thalina plant genome database

AtGI TIGR Arabidopsis thaliana gene index

ATGC Genome sequencing at ATGC

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1057

ATIDB Arabidopsis insertion database

CSHL Arabidopsis genome analysis at Cold Spring

ESSA Arabidopsis thalina project at MIPS

Genoscope AGI in France

Kazusa Arabidopsis thaliana genome info Japan

MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center

Stanford Sequencing of the Arabidopsis genome at Stanford

TAIR Arabidopsis information resource

TIGR TIGR Arabidopsis genome annotation database

Wustl Arabidopsis genome at Washington university

Trees A forest tree genome database

Bacterial genomes

B Subtilus Bacillus subtilus database

Chlamydomonas Chlamydomonas genetics center

E coli Ecoli genome project

MGD Microbial germ plasm database

Microbial Microbial Genome Gateway

Microbial Microbial genomes

Micado Genetics maps of B subtilis and E coli

MycDB A integrated Mycobacterial database

Neisseria Neisseria meningitidis genome

Neurospora Neurospora crassa database

OralGen Oral pathogen database

Salmonella Salmonella information

STDGen Sexulally transmitted disease database

Bass

Bass Sea Bass Mapping project

Cat (Felis catus)

Cat ArkDB Cat mapping database

Cattle (Bos taurus)

ARK Farm animals

BoLA Bovine MHC information

Bovin Bovine genome databaseBovMap Mapping the bovine genome

CaDBase Genetic diversity in cattles

ComRad Comparative radiation hybrid mapping

Cow ArkDB Bovine ArkDB

GemQual Genetics of meat quality

Chicken (Gallus gallus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1157

Chicken Poultry gene mapping project

ChickMap Chicken genome project

Chicken ArkDB Chicken database

ChickEST Chick EST database

Poultry Poultry genome project

Cotton

Cotton Cotton data collection site

Cyano Bacteria (Blue green algae)

Cyano Bacteria Anabaena genome

Daphnia (Crustacea)

Daphnia pulex Daphnia genomics consortium

Deer

Deer ArkDB Deer mapping database

Dictyostelium discoideum

Dicty_cDB Dictyostelium discoideum cDNA project

DGP Dictyostelium discoideum genome project

Dictybase Online informatics resources for Dictyostelium

Dog (Canis familiaris)

Dog Dog genome project

Dog genome project

Frog (Xenopus)

Xenbase A Xenopus web resource

Xenopus Xenopus tropicalis genome

Fruit fly (Drosophila melanogaster)

ENSEMBL Drosophila Genome Browser at ENSEMBL

Fruitfly Drosophila genome project at Berkeley

FlyBase A Database of the Drosophila Genome

FlyMove A Drosophila multimedia database

FlyView A Drosophila image database

Fungus

Aspergillus Aspergillus Genomics

Candida Candida albicans information page

FungalWeb Fungi database

FGSC Fungal genetic stocks center

Goat (Capra hircus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 9: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 957

ProDom Protein domain database

Prints Protein motif fingerprint database

Prosite Database of protein families and domains

SMART Simple modular architecture research tool

TIGRFAM Protein families based on HMMs

Others

Phospho Site Database of phosphorylation sites

PROW Protein reviews on the web

Protein Lounge Complete systems biology

Other Databases

Carbohydarate Databases

Carb DB Carbohydrate Sequence and Structure Database

GlycoWord Glycoscience related information

SPECARB Raman Spectra of carbohydrates

Other Databases

AlzGene Alzheimerrsquos disease

Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia

Model Organism Databases and Resources

Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash

Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash

Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash

Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish

General Information

GMOD Generic Model Organism Database

Model Organisms The WWW virtual library of model organisms

Arabidopsis thaliana

ABRC Arabidopsis biological resource center

AGI Arabidopsis genome initiative

AREX Arabidopsis gene expression database

Arabinet Arabidopsis information on the www

AtGDB An Arabidopsis thalina plant genome database

AtGI TIGR Arabidopsis thaliana gene index

ATGC Genome sequencing at ATGC

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1057

ATIDB Arabidopsis insertion database

CSHL Arabidopsis genome analysis at Cold Spring

ESSA Arabidopsis thalina project at MIPS

Genoscope AGI in France

Kazusa Arabidopsis thaliana genome info Japan

MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center

Stanford Sequencing of the Arabidopsis genome at Stanford

TAIR Arabidopsis information resource

TIGR TIGR Arabidopsis genome annotation database

Wustl Arabidopsis genome at Washington university

Trees A forest tree genome database

Bacterial genomes

B Subtilus Bacillus subtilus database

Chlamydomonas Chlamydomonas genetics center

E coli Ecoli genome project

MGD Microbial germ plasm database

Microbial Microbial Genome Gateway

Microbial Microbial genomes

Micado Genetics maps of B subtilis and E coli

MycDB A integrated Mycobacterial database

Neisseria Neisseria meningitidis genome

Neurospora Neurospora crassa database

OralGen Oral pathogen database

Salmonella Salmonella information

STDGen Sexulally transmitted disease database

Bass

Bass Sea Bass Mapping project

Cat (Felis catus)

Cat ArkDB Cat mapping database

Cattle (Bos taurus)

ARK Farm animals

BoLA Bovine MHC information

Bovin Bovine genome databaseBovMap Mapping the bovine genome

CaDBase Genetic diversity in cattles

ComRad Comparative radiation hybrid mapping

Cow ArkDB Bovine ArkDB

GemQual Genetics of meat quality

Chicken (Gallus gallus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1157

Chicken Poultry gene mapping project

ChickMap Chicken genome project

Chicken ArkDB Chicken database

ChickEST Chick EST database

Poultry Poultry genome project

Cotton

Cotton Cotton data collection site

Cyano Bacteria (Blue green algae)

Cyano Bacteria Anabaena genome

Daphnia (Crustacea)

Daphnia pulex Daphnia genomics consortium

Deer

Deer ArkDB Deer mapping database

Dictyostelium discoideum

Dicty_cDB Dictyostelium discoideum cDNA project

DGP Dictyostelium discoideum genome project

Dictybase Online informatics resources for Dictyostelium

Dog (Canis familiaris)

Dog Dog genome project

Dog genome project

Frog (Xenopus)

Xenbase A Xenopus web resource

Xenopus Xenopus tropicalis genome

Fruit fly (Drosophila melanogaster)

ENSEMBL Drosophila Genome Browser at ENSEMBL

Fruitfly Drosophila genome project at Berkeley

FlyBase A Database of the Drosophila Genome

FlyMove A Drosophila multimedia database

FlyView A Drosophila image database

Fungus

Aspergillus Aspergillus Genomics

Candida Candida albicans information page

FungalWeb Fungi database

FGSC Fungal genetic stocks center

Goat (Capra hircus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 10: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1057

ATIDB Arabidopsis insertion database

CSHL Arabidopsis genome analysis at Cold Spring

ESSA Arabidopsis thalina project at MIPS

Genoscope AGI in France

Kazusa Arabidopsis thaliana genome info Japan

MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center

Stanford Sequencing of the Arabidopsis genome at Stanford

TAIR Arabidopsis information resource

TIGR TIGR Arabidopsis genome annotation database

Wustl Arabidopsis genome at Washington university

Trees A forest tree genome database

Bacterial genomes

B Subtilus Bacillus subtilus database

Chlamydomonas Chlamydomonas genetics center

E coli Ecoli genome project

MGD Microbial germ plasm database

Microbial Microbial Genome Gateway

Microbial Microbial genomes

Micado Genetics maps of B subtilis and E coli

MycDB A integrated Mycobacterial database

Neisseria Neisseria meningitidis genome

Neurospora Neurospora crassa database

OralGen Oral pathogen database

Salmonella Salmonella information

STDGen Sexulally transmitted disease database

Bass

Bass Sea Bass Mapping project

Cat (Felis catus)

Cat ArkDB Cat mapping database

Cattle (Bos taurus)

ARK Farm animals

BoLA Bovine MHC information

Bovin Bovine genome databaseBovMap Mapping the bovine genome

CaDBase Genetic diversity in cattles

ComRad Comparative radiation hybrid mapping

Cow ArkDB Bovine ArkDB

GemQual Genetics of meat quality

Chicken (Gallus gallus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1157

Chicken Poultry gene mapping project

ChickMap Chicken genome project

Chicken ArkDB Chicken database

ChickEST Chick EST database

Poultry Poultry genome project

Cotton

Cotton Cotton data collection site

Cyano Bacteria (Blue green algae)

Cyano Bacteria Anabaena genome

Daphnia (Crustacea)

Daphnia pulex Daphnia genomics consortium

Deer

Deer ArkDB Deer mapping database

Dictyostelium discoideum

Dicty_cDB Dictyostelium discoideum cDNA project

DGP Dictyostelium discoideum genome project

Dictybase Online informatics resources for Dictyostelium

Dog (Canis familiaris)

Dog Dog genome project

Dog genome project

Frog (Xenopus)

Xenbase A Xenopus web resource

Xenopus Xenopus tropicalis genome

Fruit fly (Drosophila melanogaster)

ENSEMBL Drosophila Genome Browser at ENSEMBL

Fruitfly Drosophila genome project at Berkeley

FlyBase A Database of the Drosophila Genome

FlyMove A Drosophila multimedia database

FlyView A Drosophila image database

Fungus

Aspergillus Aspergillus Genomics

Candida Candida albicans information page

FungalWeb Fungi database

FGSC Fungal genetic stocks center

Goat (Capra hircus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 11: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1157

Chicken Poultry gene mapping project

ChickMap Chicken genome project

Chicken ArkDB Chicken database

ChickEST Chick EST database

Poultry Poultry genome project

Cotton

Cotton Cotton data collection site

Cyano Bacteria (Blue green algae)

Cyano Bacteria Anabaena genome

Daphnia (Crustacea)

Daphnia pulex Daphnia genomics consortium

Deer

Deer ArkDB Deer mapping database

Dictyostelium discoideum

Dicty_cDB Dictyostelium discoideum cDNA project

DGP Dictyostelium discoideum genome project

Dictybase Online informatics resources for Dictyostelium

Dog (Canis familiaris)

Dog Dog genome project

Dog genome project

Frog (Xenopus)

Xenbase A Xenopus web resource

Xenopus Xenopus tropicalis genome

Fruit fly (Drosophila melanogaster)

ENSEMBL Drosophila Genome Browser at ENSEMBL

Fruitfly Drosophila genome project at Berkeley

FlyBase A Database of the Drosophila Genome

FlyMove A Drosophila multimedia database

FlyView A Drosophila image database

Fungus

Aspergillus Aspergillus Genomics

Candida Candida albicans information page

FungalWeb Fungi database

FGSC Fungal genetic stocks center

Goat (Capra hircus)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 12: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1257

Goat GoatMap mapping the caprine genome

Horse (Equus caballus)

Horse ArkDB Horse mapping database

Madaka Fish Medaka Medaka fish home page

Maize

Maize Maize genome database

Malaria (Plasmodium spp)

Malaria Malaria genetics and genomics

PlasmoDB Plasmodium falciparum genome database

Parasites Parasite databases of clustered ESTs

Parasite Genome Parasite genome databases

Mosquito

Mosquito Mosquito genome web server

Mouse (Mus musculus)

ENSEMBL Mouse genome server at ENSEMBL

Jackson Lab Mouse Resources

MRC Mouse genome center at MRC UK

MGI Mouse genome informatics at Jackson Labs

MGD Mouse genome database

MGS Mouse genome sequencing at NIH

MIT Genetic and physical maps of the mouse genome

Mouse SNP Mouse SNP database

NCI Mouse repository

NIH NIH mouse initiative

ORNL Mutent mouse database

RIKEN Mouse resources

Rodentia The whole mouse catalog

Pig (Sus scrofa)

INCO Pig trait gene mapping

Pig Pig EST databasePig Pig gene mapping project

PiGBase Pig genome mapping

Pig ArkDB Pig Ark DB

Plants

PlantGDB Resources for plant comparative genomics

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 13: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1357

Protozoa

Protozoa Protozoan genomes

Pufferfish

Fugu Puffer fish project UK site

Fugu Fugu genome project SingaporeFugu Puffer fish project USA

Rat (Ratus norvigicus)

MIT Genetic maps of the Rat genome

NIH Rat genomics and genetics

Rat RatMap

RGD Rat genome database

Rice (Oriza sativa)

MPSS Massively parallel signature sequencing

Rice-research Rice genome sequence database

Rice Rice genome project

Rickettsia

RicBase Rickettsia genome database

Salmon

Salmon ArkDB Salmon mapping database

Sheep (Ovis aries)

Sheep Sheep gene mapping

SheepBase Sheep gene mapping

Sheep ArkDB Sheep mapping database

Soy

Soy Soybeans database

Sorghum

Sorghum Sorghum Genomics

Tetraodon

Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead

Tilapia

HCGS Tilapia genome

Tilapia ArkDB Tilapia mapping database

Turkey

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 14: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1457

Turkey ArkDB Turkey mapping database

Viruses

HIV HIV sequence database

Herpes Human herpes virus 5 database

Worm (Caenorhabditis elegans)

C elegans C elegans genome sequencing project

NemBase Resource for nematode sequence and functional data

WormAtlas Anatomy of C elegans

WormBase The Genome and biology of C elegans

ACEDB A C elegans database

WWW Server C elegans web server

Yeast

SCPD The promoter database of Saccharomyces cerevisiae

SGD Saccharomyces genome database

S Pompe Schizosaccharomyces pompe genome project

TRIPLES Functional analysis of Yeast genome at Yale

Yeast Intron database Spliceosomal introns of the yeast

Zebra fish (Danio rerio)

ZFIN Zebrafish information network

ZGR Zebrafish genome resources

ZIS Zebrafish information server

Zebrafish Zebrafish webserver

DOMAIN DATABASE

Domains can be thought of as distinct functional andor structural units of a

protein These two classifications coincide rather often as a matter of fact and what It is

found as an independently folding unit of a polypeptide chain carrying specific

function Domains are often identified as recurring (sequence or structure) units

which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different

arrangements to modulate protein function We can define conserved domains as

recurring units in molecular evolution the extents of which can be determined by

sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 15: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1557

The goal of the NCBI conserved domain curation project is to provide database users with

insights into how patterns of residue conservation and divergence in a family relate to functional

properties and to provide useful links to more detailed information that may help to understand

those sequencestructurefunction relationships To do this CDD Curators include the following

types of information in order to supplement and enrich the traditional multiple sequence

alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature

resources

CDD

Conserved DomainDatabase (CDD)

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast

identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications

CD-Search

amp

Batch CD-Search

CD-Search is NCBIs interface to searching the Conserved

Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to

quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including

information about running CD-Search locally

Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual

protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details

CD-Search (Help amp FTP) Batch CD-Search (Help) Publications

CDARTDomain Architectures

Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 16: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1657

queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to

the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links

menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications

CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of

protein and protein domain familiesAbout Install Publications

Content[edit source]

CDD content includes NCBI manually curated domain models and domain models imported from

a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique

about NCBI-curated domains is that they use 3D-structure information to explicitly define domain

boundaries align blocks amend alignment details and provide insights into

sequencestructurefunction relationships Manually curated models are organized hierarchically if

they describe domain families that are clearly related by common descent To provide a non-

redundant view of the data CDD clusters similar domain models from various sources into

superfamilies

Searching the database[edit source]

The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of

annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a

eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 17: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1757

Conserved Domain Curators attempt to organize related domainmodels into

phylogenetic family hierarchies

Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field

PFAM

Pfam 270 (Mar 2013 14831 families)

Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein

The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)

There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases

Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 18: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1857

Pfam entries are classified in one of four ways

FamilyA collection of related protein regions

DomainA structural unit

RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present

MotifsA short unit found outside globular domains

Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of

sequence structure orprofile-HMM

1 2 3 4 5 6

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 19: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 1957

Pfam 270 (March 2013 14831 families)

Pfam also generates higher-level groupings of related families known

as clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence structure or profile-HMM

QUICK LINKS

SEQUENCE SEARCH

VIEW A PFAM FAMILY

VIEW A CLAN

VIEW A SEQUENCE

VIEW A STRUCTURE

KEYWORD SEARCH

JUMP TO

YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS

Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 20: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2057

See groups of related families

Look at the domain organisation of a protein sequence Find the domains on a PDB structure

Query Pfam by keywords

Go Example

Enter any type of accession or ID to jump to the page for a Pfam family or clan

UniProt sequence PDB structure etc

Browse Pfam

You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can

also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3

Architecture

The collection of domains that are present on a protein

Clan

A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM

Domain

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 21: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2157

A structural unit

Domain score

The score of a single domain aligned to an HMM Note that for HMMER2 if

there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3

DUF

Domain of unknown function

Envelope coordinates

See Alignment coordinates

Family

A collection of related protein regions

Full alignment

An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry

Gathering threshold (GA)

Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff

HMMER

The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information

Hidden Markov model (HMM)

A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 22: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2257

HMMER3

The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information

iPfam

A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction

Metaseq

A collection of sequences derived from various metagenomics datasets

Motif

A short unit found outside globular domains

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of

representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite

Posterior probability

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 23: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2357

HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability

Repeat

A short unit which is unstable in isolation but forms a stable structure when

multiple copies are present

Seed alignment

An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry

Sequence score

The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and

taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa

Features

You can use SMART in two different modes normal or genomicThe main difference is in the underlying

protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable

Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used

Ensembl for metazoans and Swiss-Prot for the rest

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 24: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2457

The protein database in Normal SMART has significant redundancy even though identical proteins are

removed If you use SMART to explore domain architectures or want to find exact domain counts in various

genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more

accurate and there will not be many protein fragments corresponding to the same gene in the architecture

query results We should Remember that we are exploring a limited set of genomes though

Different color schemes are used to easily identify the mode we are in

Normal mode Genomic mode

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually

Alignment block

Ungapped alignments that usually represent a single secondary structure

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score

BLAST Basic local al ignment search tool

An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences

Cellular Role

Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 25: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2557

Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation

The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn

2+-binding or Ca

2+- binding domains the hydrophobic core may be

provided by cystines and metal ions respectively Homologous domains with common functions

usually show sequence similarities

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of thequery

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed)

Entrez

A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information

E-value

This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus

Gap

A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures

Genomic database

Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 26: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2657

HMM Hidden Markov m odel

HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)

Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM

HMM consensus

The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)

HMMer

The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores

Homology

Evolutionary descent from a common ancestor due to gene duplication

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages

Motif

Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues

NRDB non-redundant database

A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database

ORF

Open reading frame

Outlier homologues

These are often difficult to detect using HMM methodology A complementary approach to their

detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

P-value

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 27: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2757

This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART

PDB protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants

Profile

A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]

ProfileScan

An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)

PROSITE

This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE

Schnipsel database domain sequence database

Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches

Searched domains

In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains

Secondary Literature

The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list

Seed Alignment

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 28: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2857

Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)

SEG

A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria

1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange

2 cytoplasmic domains that occur in at least two proteins with different domain

organisations of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)

SMART Simple Modular Arc hi tecture Research Tool

Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences

Species

Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2)

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper

WiseTools

A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 29: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 2957

sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)

Whatrsquos new

Changes from version 60 to 70

Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins

metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets

iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of

Life New option can be found in the protein list function select list User interface cleanup

Various small changes to the UI resulting in faster and easier navigation

Changes from version 51 to 60

Metabolic pathways information

SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups

Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species

Changes from version 50 to 51

SMART webservice

You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)

SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins

If you need help with these services or have questionsfeedback please contact us

Changes from version 41 to 50

New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used

o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 30: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3057

o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not

members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages

Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal

Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin

Changes from version 40 to 41

Two modes of operation Normal and Genomic

For more details visit the change mode page

Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method

Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link

Taxonomic trees

Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree

The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned

SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites

Changes from version 35 to 40

Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 31: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3157

Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence

You can switch off intron display on your SMART preferences page

Alternative splicing information

Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them

Orthology information

There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page

Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple

sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)

Changes from version 34 to 35

Features of all Ensembl genomes are stored in SMART

You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries

Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day

Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed

showing all of them

Changes from version 33 to 34

Search structure based profiles using RPS-Blast

Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 32: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3257

Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges

Pfam domains are stored in the database

SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)

Try our SMART Toolbar for Mozilla web browser Click here for more info

Changes from version 32 to 33

Fantastic new protein picture generator

Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download

Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser

Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here

Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL

Techincal changes

SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements

Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)

Changes from version 31 to 32

Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately

Literature Changed literature identifiers to PMID New secondary literature generator that parses all

neigbouring papers not just first 100 Numerous small bug fixes and improvements

Changes from version 30 to 31

Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 33: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3357

Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched

Taxonomic breakdown

When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered

Links

Links dont contain version numbers This allows stable links from external sources Selective SMART

Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3

Annotation You can align youre query sequence to the SMART alignment using hmmalign

Update of underlying database SMART now uses PostgreSQL 652

Changes from version 20 to 30

Digest output SMART now only produces a single diagram representing a best interpretation of all the

annotation that has been performed A comprehensive summary of the results is also provided intable format

selective SMART

Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges

alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query

Domain queries You can ask for proteins having the same domain order composition as your query protein

SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages

Faster PFAM searches The PFAM searches now runs on a PVM cluster

Changes from version 103 to 20

Domain coverage

The original set of signalling domains has now been extended to include extracellular domains

Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit

Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question

Literature database

In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 34: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3457

The CATH domain structure database new protocols and

classification levels give a more comprehensive resource for exploring evolution

Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo

Author information Article notes Copyright and License information

This article has been cited by other articles in PMC

Go to

WHATrsquoS NEW

We report the latest release (version 30) of the CATH protein domain

database (httpwwwcathdbinfo) There has been a 20 increase in the

number of structural domains classified in CATH up to 86 151 domains

Release 30 comprises 1110 fold groups and 2147 homologous

superfamilies To cope with the increases in diverse structural

homologues being determined by the structural genomics initiatives

more sensitive methods have been developed for identifying boundaries

in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary

of homologous structures (CATH-DHS) which now contains multiple

structural alignments consensus information and functional

annotations for 1459 well populated superfamilies in CATH CATH is

directly linked to the Gene3D database which is a projection of CATH

structural data ontosim

2 million sequences in completed genomes andUniProt

Go to

GENERAL INTRODUCTION

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 35: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3557

The numbers of new structures being deposited in the Protein Data Bank

(PDB) continues to grow at a considerable rate In addition structures

being targeted by world wide structural genomics initiatives are more

likely to be novel or only very remotely related to domains previously

classifiedOnly 2 of structures currently solved by conventional

crystallography or NMR are likely to adopt novel folds (see Figures 1 and

and2)2) A higher proportion of new folds are expected to be solved by

structural genomics structures Although the influx of more diverse

structures and subsequent analysis will inform our understanding of

how domains evolve it has resulted in increasing lags between the

numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition

Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version

Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains

Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 36: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3657

Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses

In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)

and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented

Go to

A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID

In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The

CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross

orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 37: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3757

Table 1 CATH version 30 statistics

Go to

DOMAIN BOUNDARY ASSIGNMENTS

We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid

secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein

Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries

Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 38: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3857

being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )

For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies

any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment

Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol

Go to

NEW HOMOLOGUE RECOGNITION METHODS

We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below

For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have

investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 39: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 3957

scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4

Go to

NEW UPDATE PROTOCOL

Automatic methods

Previously CATH data was generated using a group of independent

programs and flat files Over the past two years we have developed an

update protocol for CATH that is driven by a suite of programs with a

central library and a PostgreSQL database system A classification

pipeline has been established which links in a completely automated

fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can

essentially be divided into two parts domain boundary assignment and

domain homology classification (see Figure 3) The aim of the protocol is

to minimise manual assignment and provide as much support as

possible when manual validation is necessary

Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or

fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)

Web pages to support manual validation

For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS

CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 40: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4057

Go to

OVERVIEW OF THE CURRENT RELEASE (VERSION 30)

Assigning domain boundaries and relationships between protein structures is

computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features

Percentage of new topologies

An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1

Number of domains within a protein chain

Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length

The CATH Dictionary of Homologous Superfamilies (CATH-DHS)

The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 41: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4157

For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)

To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were

identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences

Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 42: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4257

Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)

Go to

LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM

Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group

In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily

representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can

now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 43: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4357

Go to

FEATURES

The CATH database can be accessed at httpwwwcathdbinfo The web interface

may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs

Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for

identical proteins and generally returns scores above 80 for homologous proteins More distantly

related folds generally give scores above 70 (Topology or fold level) though in the absence of any

sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)

Abstract

We report the latest release (version 14) of the CATH protein domains database

(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827

homologous families in which the proteins have both structual similarity and sequence andor

functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures

Using our structural classification and associated data on protein functions stored in the database (EC

identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of

folds in the PDB are associated with a single homologous family However within the superfolds

three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the

EC identifiers of their relatives Our analysis supports the view that determining structures for

example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section

Introduction FOR CATH

The CATH classification of protein domain structures was established in 1993 (1) as a

hierarchical clustering of protein domain structures into evolutionary families and structural

groupings depending on sequence and structure similarity There are four major levels

corresponding to protein class architecture topology or fold and homologous family (Fig 1)

Since 1995 information about these structural groups and protein families has been accessible

over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information

about each individual protein structure (PDBsum) (2)

CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships

At the lowest levels in the hierarchy proteins are grouped into evolutionary families

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 44: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4457

(Homologous familes) for having either significant sequence similarity (ge35 identity) or high

structural similarity and some sequence similarity (ge20 identity)

TheCATH database of protein structures contains approximately 18000 domains

organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous

superfamily Relationships between evolutionary related structures (homologues) within the

database have been used to test the sensitivity of various sequence search methods in order to

identify relatives in Genbank and other sequence databases Subsequent application of the most

sensitive and efficient algorithms gapped blast and the profile based method Position Specific

Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to

between 22 and 36 of microbial genomes in order to improve functional annotation and

enhance understanding of biological mechanism However on a cautionary note an analysis of

functional conservation within fold groups and homologous superfamilies in the CATH database

revealed that whilst function was conserved in nearly 55 of enzyme families function had

diverged considerably in some highly populated families In these families functional properties

should be inherited far more cautiously and the probable effects of substitutions in key functional

residues must be carefully assessed

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 45: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4557

Figure 1

Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH

database

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 46: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4657

Figure 2

Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies

for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions

The multiple structural alignment shown has been coloured according to secondary structure

assignments (red for helix blue for strands)

The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the

top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three

major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)

Before classification multidomain proteins are first separated into their constituent folds using a

consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in

particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity

Although there are plans to assign the more regular architectures automatically all architecture

groupings are currently assigned manually

A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple

structure based alignments are also available coloured according to secondary structure assignments

or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted

to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams

(httpwww3ebiacuktops 10)

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 47: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4757

Figure 3

CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green

mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the

number of homologous families in CATH whilst each band in the outer wheel corresponds to a single

fold family The size of each fold bandlsquo therefore reflects the number of homologous families having

that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the

population of homologous families in the different architectures

We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)

(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)

which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the

last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary

homologous families (H-level) Whilst recognising more distant structural similarity with no

accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)

The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown

in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account

for nearly 30 of non-homologous structuresPrevious SectionNext Section

Implications for Structural Genomics

As the sequence databases grow rapidly the need to interpret these sequences and assign functions to

specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 48: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4857

detectable sequence similarity despite conservation of 3D structure and function For these cases

evolutionary relationships and thereby functions can only be assigned by comparing the structures

Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify

all the folds in nature with the ultimate goal of being able to predict the function of a new protein from

its known or probable structure The important questions to ask are how many more folds do we need

to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures

In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family

by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique

structures currently in the PDB compared with sim20 000 sequence families it is clear that we still

need to determine many more structures if we are to understand biology at the molecular level

However analysis of recently deposited structural data is very revealingFigure 4a illustrates the

distribution of 2159 new structural domains classified in the 10 months from June 1997 to March

1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure

Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were

novel folds the remainder resembling a previously determined structure Many of these 199 (45)

could be identified as clear homologues by having significant structure and sequence similarity (SSAP

ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using

sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There

remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous

entry but neither the sequence nor the function gave definite evidence of a common ancestor

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 49: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 4957

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 50: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5057

At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed

that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the

first three EC identifiers were the same Considering those families where homologues have

significant sequence identity (ge20) after structural alignment 95 were found to have a single EC

identifier whilst for families where proteins have more than 30 sequence similarity we observed

that 98 had a single EC code

Although assigning function on the basis of homology is common practice it is clear that some

caution should be exercised particularly where there is little or no sequence similarity There are also

some clear examples where homologues with significant sequence similarity perform different

functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as

enzymes in other cellular environments but which are used as structural proteins in this context (17)

The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time

For enzymes it is clear that catalytic function can change and evolve usually to act on a different but

related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of

the β-barrel (eg retinol bilin biotin)

Nearly half of the homologous families where two or more different EC numbers were observed

belong to the superfolds This suggests that if a new protein is assigned to a superfold family more

caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note

that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or

ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the

crossover of the polypeptide chain for the doubly wound Rossmann structures

Previous SectionNext Section

Assignment of Function Through Structure

One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign

function in several ways

i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats

imposed by gene recruitmentlsquo discussed above)

ii The structural data allows detailed inspection of the functional site mdash to suggest if and

how the function may have evolved For example if an enzyme has evolved to act on a

different substrate the binding site may reveal or at least suggest possible changes in the

substrate

iii For the superfolds similarity of structure does not necessarily mean similarity of

function However the active sitebinding sites are often conserved eg in the TIM barrel or

Rossmann fold structures the ligand always binds at the same end of the barrel or sheet

iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For

example enzymes can often be identified by the presence of a major cleft which also locates

the active site (18) Similarly critical surface patches which are used for molecular

recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)

In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70

of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 51: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5157

will almost certainly reveal some clues to the function For the superfolds some folds will reveal

information on the functional class (eg enzyme for TIM barrels) or the location of the active site if

not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab

initio methods referred to above may provide some clues to guide experiments Therefore it is clear

that determining structures as part of a structural genomicslsquo initiative for example will make a

major contribution to interpreting genome data

Jail

Just another interface library

Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids

Of course not

Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT

Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base

Nevertheless it is essential to analyse the interacting parts of the proteins to understand the

process of protein-protein docking

To overcome this problem we have built up the JAIL database Since interacting domains

exhibit similiar structural features than proteins all known interfaces between interacting

domains of the SCOP database were extracted and classified in JAIL

Only a part of all protein structures are included in SCOP Particularly new PDB entries are

not yet annotated To overcome this problem additionally all interfaces between protein

chains were calculated and included in the database This type of interface also comprises

the interacting parts of the assumed biological units The last important type of interfaces

provided here is composed of the interacting parts between proteins and nucleic acids

Overall the data set consists of about 180000 interfaces

JAIL is a comfortable tool to browse through the interface library and to analyze single

interfaces However more general questions require large-scale analysis For this purpose

a detailed form enables the compiling of comprehensive non redundant data sets for

download

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 52: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5257

How is an interface defined

A complete residue is part of an interface if at least one atom of the aminoacid is located

within a range of 45 Angstroem of any atom of the interacting domain or chain One part of

an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms

of the RNADNA-backbone

What are biounits

The primary coordinate file deposited in the PDB generally contains one asymmetric unit

The asymmetric unit is the smallest portion of a crystal structure to which crystallographic

symmetry can be applied to generate one cell The biological molecule (biounit) is believed

to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited

in a separate section of the PDB database and can be used for interface calculations More

information about biounits

In which way are redundant interfaces excluded in the download section

The redundancy is excluded in two different ways by structure and by sequence The

sequential clustering is based on the Cd-hit program The structural clustering is defined by

the protein families and superfamilies of the SCOP classification The database classifies

proteins by domain architecture

Which settings in the download section are best for my own research

The selection of the datasets depends on the type of interactions (protein-protein or protein-

nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50

results in a higher diversity than the setting to 95 The default settings include interfaces of

domain-domain interactions as well as interfaces between interacting chains All interfaces of

chains that were already treated by the SCOP domain interfaces are excluded by default

This procedure results in a high number of interfaces that are still diverse enough for

statistical analysis

What is meant by show conservation in Jmol

The conservation of protein sequences is defined by the mutation rates at each amino acid

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 53: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5357

position For JAIL this information was retrieved from ConSurf ConSurf is a derived

database merging structural and sequence information

Database scheme

Search for

PDB-Id eg 1aay

SCOP-Id eg d1az0a_

EC-Number eg 311

Accession number eg P03697

Protein name eg capsid protein

Search in the following interface types

DomainDomain (SCOP)

ChainChain

ProteinNucleic

BiounitBiounit

None

Fulltext search

Keyword

Search

Clear

SCOP text search

Keyword

Search

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 54: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5457

Consider only interfaces having the following location

intra inter dont care

Search

Clear

MMDB

Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally

identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more

Three-dimensional structures are now known within many protein families and it is

quite likely in searching a sequence database that one will encounter a homolog

with known structure The goal of Entrezrsquos 3D-structure database is to make this

information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful

features (i) Sequence and structure neighbors one may select all sequences similar

to one of interest for example and link to any known 3D structures (ii) Links

between databases one may search by term matching in MEDLINE for example and

link to 3D structures reported in these articles (iii) Sequence and structure

visualization identifying a homolog with known structure one may view molecular-

graphic and alignment displays to infer approximate 3D structure In this article we

focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not

described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and

links from individual chains (and compact 3D domains within them) to structure

neighbors other chains (and 3D domains) with similar 3D structure MMDB may be

accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 55: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5557

SUPERFAMILY is a database of structural and functional annotation for all

proteins and genomes[1][2][3][4][5]

The SUPERFAMILY annotation is based on a collection of hidden Markov models which

represent structural protein domains at the SCOP superfamilylevel[6]

A superfamily groups

together domains which have an evolutionaryr elationship The annotation is produced by

scanning protein sequences from completely sequenced genomes against the hidden Markov

models

For each protein you can

Submit sequences for SCOP classification

View domain organisation sequence alignments and protein sequence details

For each genome you can

Examine superfamily assignments phylogenetic trees domain organisation lists and

networks

Check for over- and under-represented superfamilies within a genome

For each superfamily you can

Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro

abstract and genome assignments

Explore taxonomic distribution of a superfamily across the tree of life

All annotation models and the database dump are freely available for download to everyone

Contents

[hide]

1 Purpose

2 See also

3 References

4 External links

Purpose[edit source]

SUPERFAMILY classifies amino acid sequences into known structural domains especially

into SCOP superfamilies The superfamilies are groups of proteins which have structural

evidence to support a common evolutionary ancestor but may not have detectable

sequence homology

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 56: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5657

Major Features

Sequence

search

Submit your protein or DNA sequence for SCOP superfamily and family

level classification

Keyword

search

Search for superfamily family or species names plus

sequence SCOP PDB or hidden Markov model IDs

Domain

assignments

Domain assignments alignments and architectures for completely

sequencedeukaryotic and prokaryotic organisms plus sequence

collections

Comparative

genomics

tools

Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain

combinations domain architecture co-occurrence networks and domain

distribution across taxonomic kingdoms for each organism

Genomestatistics

For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total

sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size

percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain

architectures

GeneOntology

Domain-centric Gene Ontology (GO) automatically annotated by HaiFang

Phenptype

Ontology

Domain-centric phenotypeanatomy ontology including Disease

Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast

Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant

Superfamily

annotation

InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)

annotation for 763 superfamilies

Functionalannotation

Functional annotation of SCOP 173 superfamilies by Christine Vogel

Phylogenetic

trees

Trees are generated using heuristic parsimony methods and are based on

protein domain architecture data for all genomes in SUPERFAMILY

Genome combinations or specific clades can be displayed as individual

trees

Similar

domain

architectures

Find the 10 domain architectures which are most similar to a domainarchitecture of interest

HiddenMarkov

models

Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera

eg model 0045110

Profile

comparison

Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two

profile hidden Markov models by Martin Madera

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment

Page 57: Essential Info Notes-1

7272019 Essential Info Notes-1

httpslidepdfcomreaderfullessential-info-notes-1 5757

Web services Distributed Annotation Server and linking to SUPERFAMILY

Downloads Sequences assignments models MySQL database and scripts - updatedweekly

Jump to [ SUPERFAMILY description middot Major features middot Top of page ]

Recent news

3rd June 2013 genetrainer being launched at LeWeb conference

The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily

level for the predicted protein sequences in over 400 completed genomes A superfamily groups

together domains of different families which have a common evolutionary ancestor based on

structural functional and sequence data SUPERFAMILY domain assignments are generated

using an expert curated set of profile hidden Markov models All models and structural

assignments are available for browsing and download from httpsupfamorg The web interface

includes services such as domain architectures and alignment details for all protein assignments

searchable domain combinations domain occurrence network visualization detection of over- or

under-represented superfamilies for a given genome by comparison with other genomes

assignment of manually submitted sequences and keyword searches In this update we describe

the SUPERFAMILY database and outline two major developments (i) incorporation of family

level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY

database can be used for general protein evolution and superfamily-specific studies genomic

annotation and structural genomics target suggestion and assessment